<center><h1>Python Pandas Tutorial</h1><center>

![Pandas](https://pandas.pydata.org/_static/pandas_logo.png)

## Pandas is Python Data Analysis Library

pandas is an open source, BSD-licensed(can use for commercial means) library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

* Widely used
* Open Source
* Active Development
* Great Documentation

Home Page: http://pandas.pydata.org/

Using Documentation from: http://pandas.pydata.org/pandas-docs/stable/

Fantastic Cheat Sheet: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Best book by Panda's creator Wes Kinney (2nd Edition 2017): http://shop.oreilly.com/product/0636920050896.do

In [None]:
import pandas as pd

In [None]:
# Pandas is a big package took a while...

In [None]:
import numpy as np # another big library with various numeric functions

In [None]:
import matplotlib.pyplot as plt

# Panda's two fundamental data structures: Series and DataFrame.

### Series
A Series is a one-dimensional array-like object containing a sequence of values (
similar types to NumPy types) and an associated array of data labels - index.
Simplest Series is from an array of data.

In [None]:
# Let's create some Series!

In [None]:
s = pd.Series([1,4,3.5,3,np.nan,0,-5])
s

In [None]:
s+4

In [None]:
s2 = s * 4 
s2

In [None]:
s2**2

In [None]:
s3 = pd.Series(range(20))
s3

In [None]:
s3.describe()

In [None]:
s3.count()

In [None]:
s3.mean()

In [None]:
s.hist()

In [None]:
s4 = s3**1.5
s4

In [None]:
s4[3]

In [None]:
s4[:10]

In [None]:
s4.plot(kind="line", grid=True)

In [None]:
### Often you want Series with an index identifying each data point with a label 

In [None]:
labeledSeries = pd.Series([24, 77, -35, 31, 66], index=['d', 'e', 'a', 'g', 'g'])
labeledSeries

In [None]:
## A  bit similar to dictionary isn't it?


In [None]:
labeledSeries['d']

In [None]:
labeledSeries['g']

In [None]:
labeledSeries[:2]

In [None]:
labeledSeries[::-1]

In [None]:
type(labeledSeries.values)

In [None]:
labeledSeries.index

In [None]:
type(labeledSeries.index)

In [None]:
labeledSeries.values

In [None]:
# Accessible multiple named index values
labeledSeries[['a','d']] # NOTE double list brackets!!

In [None]:
labeledSeries > 30

In [None]:
labeledSeries[labeledSeries > 30]

In [None]:
# So Series is a fixed-length, ordered dictionary with extra helper methods

In [None]:
'd' in labeledSeries

In [None]:
77 in labeledSeries

In [None]:
77 in labeledSeries.values

In [None]:
mylist = list(s4)

In [None]:
mylist

In [None]:
lablist = list(labeledSeries)
lablist

In [None]:
# Can create series from dictionary by simply passing to constructor pd.Series(mydict)

In [None]:
citydict = {'Riga': 650000, 'Tukums':20000, 'Ogre': 25000, 'Carnikava': 3000}
citydict

In [None]:
cseries = pd.Series(citydict)
cseries

In [None]:
## Overwriting default index
clist = ['Jurmala', 'Riga', 'Tukums', 'Ogre', 'Daugavpils']
cseries2 = pd.Series(citydict, index = clist)
cseries2

In [None]:
cseries2[cseries2.isnull()] = cseries.mean()
cseries2

In [None]:
# notice Carnikava was lost, since our index did not have it!
# and order was preserved from the given index list!

In [None]:
# For missing data
myfilter = cseries2.isnull()
myfilter

In [None]:
# series.mean is a method/function not a value!
mymean = cseries2.mean()
mymean

In [None]:

cseries2[myfilter] = cseries2.mean()
cseries2

In [None]:
cseries3 = cseries + cseries2
cseries3

In [None]:
# So NaN + number = NaN

In [None]:
cseries.name = "Latvian Cities"
cseries.index.name = "City"
cseries

In [None]:
cseries.index

In [None]:
cseries.index = ['CarnikavaIsNotaCity','OgreEatsHumans', 'RigaIsOld', 'TukumsSmukums']
cseries

In [None]:
# Series values are mutable
cseries['RigaIsOld']=625000
cseries

In [None]:
# How to rename individual index elements?
cseries.index[2]='RigaIsOldButFantastic'
cseries

In [None]:
# We use Renaming method to rename individual elements

In [None]:
# limitation range only works with integers!!
series6 = pd.Series(range(1,10,2))
series6

In [None]:
# np.arange is more flexible and allows use of floats
series5= pd.Series(np.arange(1,10.5,0.5))
series5

In [None]:
# np.linspace you set start and end(including!) and how many values total you want
newseries = pd.Series(np.linspace(1,10,19))
newseries

In [None]:
cseries.rename(index={'RigaIsOld':'RigaRocks','OgreEatsHumans':'OgreEatsTastyHumans'})

In [None]:
cseries[-1]

In [None]:
Integer Indexes
Working with pandas objects indexed by integers is something that often trips up
new users due to some differences with indexing semantics on built-in Python data
structures like lists and tuples. For example, you might not expect the following code
to generate an error:



In [None]:
ser = pd.Series(np.arange(3.))
ser
ser[-1]

In [None]:
In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in
general without introducing subtle bugs. Here we have an index containing 0, 1, 2,
but inferring what the user wants (label-based indexing or position-based) is difficult:


In [None]:
In [144]: ser

In [None]:
## With a non-integer index there is no potential for ambiguity:

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

In [None]:
ser2[::-1]

In [None]:
## To keep things consistent, if you have an axis index containing integers, data selection
##will always be label-oriented. For more precise handling, use loc (for labels) or iloc
## (for integers):
ser[:2]

In [None]:
ser.loc[:1]

In [None]:
len(ser)

In [None]:
ser.iloc[:1]

* loc gets rows (or columns) with particular labels from the index.

* iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

# Date Range creation

In [None]:
dates = pd.date_range('20190528', periods=15)
dates

In [None]:
from datetime import date
date.today().strftime("%Y%m%d%W")

In [None]:
type(date.today())

In [None]:
weeks = pd.date_range(date.today().strftime("%Y%m%d"), periods = 10, freq='W-TUE')
weeks

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collec‐
tion of columns. 

Each can be a different value type (numeric, string,
boolean, etc.). 

The DataFrame has both a row and column index;

Think of it 
as a dict of Series all sharing the same index. 

Underneath data is stored as one or more two-dimensional blocks (similar to ndarray) rather than a list, dict, or some other collection of
one-dimensional arrays.

In [None]:
# Many ways of Data Frame creation
# One Common way is common is
# from a dict of equal-length lists or NumPy arrays

In [None]:
data = {'city': ['Riga', 'Riga', 'Riga', 'Jurmala', 'Jurmala', 'Jurmala'],
'year': [1990, 2000, 2018, 2001, 2002, 2003],
'popul': [0.9, 0.75, 0.62, 0.09, 0.08, 0.06]}
df = pd.DataFrame(data)
df

In [None]:
data = {'city': ['Riga', 'Riga', 'Riga', 'Jurmala', 'Jurmala', 'Jurmala'],
'year': [1990, 2000, 2018, 2001, 2002, 2003],
'popul': [0.9, 0.75, 0.62, 0.09, np.NaN, 0.06]}
df = pd.DataFrame(data)
df

In [None]:
listoflist = [list(range(n,n+10)) for n in range(5)]
listoflist

In [None]:
dflist = pd.DataFrame(listoflist)
dflist

In [None]:
dflist - 2

In [None]:
data

In [None]:
df2 = pd.DataFrame(data, columns=['year','city', 'popul','budget'])
df2

In [None]:
df2["mayor"] = False
df2

In [None]:
df2.loc[2:5,'mayor'] = "Nils"
df2

In [None]:
df2['mayor'] = False
df2

In [None]:
df2.loc[df2['city'] == "Riga", 'mayor'] = 'Nils'
df2


In [None]:
# missing column simply given Nans

In [None]:
df2['budget']=300000000
df2

In [None]:
df2['budget']=[300000, 250000, 400000, 200000, 250000, 200000] # need to pass all values
df2

In [None]:
# Many ways of changing individual values

## Recommended way of changing in place (same dataframe)



In [None]:
df2.iat[3,2]=0.063
df2

In [None]:
del df2['mayor']
df2

In [None]:
# delete column by its name
del df2[3]
df2

In [None]:
dates

In [None]:
df = pd.DataFrame(np.random.randn(15,5), index=dates, columns=list('ABCDE'))
# We passed 15 rows of 5 random elements and set index to dates and columns to our basic list elements
df

In [None]:
df.count()

In [None]:
df.describe()

In [None]:
df['A'].hist()

In [None]:
df2 = pd.DataFrame({ 'A' : 1.,
                      'B' : pd.Timestamp('20130102'),
                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                      'D' : np.array([3] * 4,dtype='int32'),
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })
df2

In [None]:
#most columns need matching length!

In [None]:
s

In [None]:
df3 = pd.DataFrame({ 'A' : 1.,
                   'B' : pd.Timestamp('20180523'),
                   'C' : s,
                   'D' : [x**2 for x in range(7)],
                   'E' : pd.Categorical(['test','train']*3+["train"]),
                   'F' : 'aha'
                   })
df3

In [None]:
## different datatypes for columns!

In [None]:
df3.dtypes

In [None]:
df3[:5]

In [None]:
df3.head()

In [None]:
df3[-3:]

In [None]:
df3.tail(3)

In [None]:
df.index

In [None]:
df3.index

In [None]:
df3.values

In [None]:
df3.describe()

In [None]:
import seaborn as sb # graphics plotting library


In [None]:
sb.pairplot(df)

In [None]:
df

In [None]:
df3

In [None]:
sb.pairplot(df3.dropna(), hue='E')

In [None]:
# Transpose

In [None]:
df3.T

In [None]:
df.sort_index(axis=1,ascending=True)

In [None]:
## Sort by Axis in reverse

In [None]:
df.sort_index(axis=1,ascending=False)

In [None]:
df3.sort_values(by='C')

In [None]:
# Notice NaN gets last

### Selection 

Note While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

In [None]:
df3['D']

In [None]:
df3.D #same as above! Syntactic Sugar!

In [None]:
df3[:5]

In [None]:
df3[2:5]

In [None]:
df3[2:5:2]

In [None]:
df3[::-1]

## Selection by Label

For getting a cross section using a label:

In [None]:
df

In [None]:
df.loc[dates[0]]

In [None]:
df.loc[dates[2:5]]

In [None]:
## Selecting on a multi-axis by label:

In [None]:
df.loc[:, ['A','B','C']]

In [None]:
df.loc[dates[2:5], ['A','B','C']]

In [None]:
df.loc['20180525':'20180601',['B','C']]

In [None]:
# Reduction in the dimensions of the returned object:

In [None]:
df.loc['20180526', ["B", "D"]]

In [None]:
## Getting scalars (single values)

In [None]:
df.loc['20180526', ["D"]]

In [None]:
# same as above

In [None]:
df.at[dates[5],'D']

In [None]:
## Selection by Position

In [None]:
df.iloc[3]

In [None]:
# By integer slices, acting similar to numpy/python:

In [None]:
df.iloc[2:5,:2]

In [None]:
# By lists of integer position locations, similar to the numpy/python style:

In [None]:
df.iloc[[3,5,1],[1,4,2]]

In [None]:
df.iloc[2,2]

In [None]:
# For getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.iat[2,2]

## Boolean Indexing

In [None]:
## Using a single column’s values to select data.

In [None]:
df[df.A > 0.2]

In [None]:
df[df > 0]

In [None]:
df[df > 1]

In [None]:
s1 = pd.Series([x**3 for x in range(15)], index=pd.date_range('20130521', periods=15))
s1

In [None]:
df['F'] = s1
df

In [None]:
## This is apparently a bug! https://github.com/pandas-dev/pandas/issues/10440

In [None]:
df['F']=42
df

In [None]:
df['G']=[x**3 for x in range(15)] # passing a fresh list to particular column
df

In [None]:
s1

In [None]:
s1+2

In [None]:
s1/3

In [None]:
df.at[dates[1], 'A'] = 33
df

In [None]:
df.iat[4,4]= 42
df

In [None]:
df3 = df.copy()

In [None]:
df3[df3 > 0.2 ] = -df3
df3

In [None]:
# Missing Data
# pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations

In [None]:
df['H'] = s1
df

In [None]:
df.fillna(value=3.14)

In [None]:
# there is also df.dropna() to drop any ROWS! with missing data

## Operations

In [None]:
df.mean()

In [None]:
# Other axis

In [None]:
df.mean(1)

In [None]:
## Apply

In [None]:
df.apply(lambda x: x*3) # ie same as df*3

In [None]:
ts = pd.Series(np.random.randn(3650), index=pd.date_range('11/18/2008', periods=3650))

In [None]:
ts = ts.cumsum() # cumulative sum

In [None]:
ts.plot()

In [None]:
# CSV
# Writing to a csv file.

In [None]:
df.to_csv("testing.csv")

In [None]:
# Reading from csv


In [None]:
df5= pd.read_csv('resources/random4x9.csv')
df5

In [None]:
df5

In [None]:
# Excel

In [None]:
df.to_excel('myx.xlsx', sheet_name='Sheet1')


In [None]:
df6=pd.read_excel('myx.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

In [None]:
df6

In [None]:
df.info()

In [None]:
df.info(memory_usage="deep") # more reliable info