![image.png](attachment:image.png)

https://pandas.pydata.org/

Pandas data-manipulation capabilities are built on top of NumPy, utilizing its fast array processing, and its graphing capabilities are built on top of Matplotlib.

* "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."

* It may be one of the most widely used tools for data munging
  * present data in nice formats
  * multiple convenient methods for filtering data
  * work with a variety of data formats (CSV, Excel, …)
  * convenient functions for quickly plotting data

* Name comes from panel data, also play on python data analysis

In [None]:
import pandas as pd
import numpy as np

In [None]:
titles = pd.Series(['And Now for Something Completely Different',
          'Monty Python and the Holy Grail',
          'Monty Python\'s Life of Brian',
          'Monty Python Live at the Hollywood Bowl',
          'Monty Python\'s The Meaning of Life',
          'Monty Python Live (Mostly)'])

In [None]:
titles

In [None]:
titles[0:2]

In [None]:
year = [1971,1975,1979,1982,1983,2014]

In [None]:
production_budget = ['100000', '400000', '4000000', None, '9000000', None]

In [None]:
# taken from https://www.the-numbers.com but make no claims to verifying the numbers
box_office = [np.nan, 5028948, 20515486, 327958, 14929552, 2215461]

In [None]:
df = pd.DataFrame({'Year': year,
                   'Titles': titles,
                   'Budget': production_budget,
                   'Gross': box_office})
df

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.dtypes

In [None]:
df['Budget']

## Basic info

In [None]:
df.shape

In [None]:
df.info()

## We will come back to nulls and Dtypes

In [None]:
df.describe()

In [None]:
df.T

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.sort_index(axis=1, ascending=False)

In [None]:
df.sort_index(axis=0, ascending=False)

In [None]:
df.sort_values(by='Gross')

In [None]:
df.sort_values(by='Gross',ascending=False)

# Selecting
loc, iloc, at, iat

In [None]:
df['Titles']

In [None]:
df[0:3]

In [None]:
df.loc[1]

In [None]:
df.loc[1,['Titles','Year']]

In [None]:
df.loc[0:2,['Titles','Year']]

In [None]:
df.loc[1,'Titles']

In [None]:
df.at[1,'Titles']

In [None]:
df.iloc[1]

In [None]:
df.iloc[1,'Titles']

In [None]:
df.iloc[1,1]

In [None]:
df.iloc[0:2,:]

In [None]:
df.iat[1,1]

# Boolean indexing

In [None]:
df['Budget'] > 1000000

Something's not right...

In [None]:
df.dtypes

In [None]:
df['Budget'].astype(float)

In [None]:
df['Budget'] > 1000000

In [None]:
df.dtypes

In [None]:
df['Budget'] = df['Budget'].astype(float)

In [None]:
df.dtypes

In [None]:
df['Budget'] > 1000000

In [None]:
df[df['Budget'] > 1000000]

# Missing data

In [None]:
df.isna()

In [None]:
df.isnull()

In [None]:
df[df['Budget'].isna()]

In [None]:
df.loc[df['Budget'].isna()].info()

In [None]:
# might want to do this
# df[df['Budget'].isna()] = 0
# but no!
# that will set entire rows to 0

In [None]:
# df[df['Budget'].isna()] = 0

In [None]:
# df[df['Budget'].isna()]

In [None]:
df.iloc[3]

In [None]:
df['Budget'].fillna(value=0)

In [None]:
df.iloc[3]

In [None]:
df['Budget'].fillna(value=0,inplace=True)

In [None]:
df

In [None]:
df['Gross']

In [None]:
df['Gross'].isna()

In [None]:
# preview of aggregate calculations
df['Gross'].fillna(value=df['Gross'].mean())

In [None]:
df['Gross'] = df['Gross'].fillna(value=df['Gross'].mean())

In [None]:
df

In [None]:
df.info()

In [None]:
pd.to_datetime(df['Year'])

In [None]:
df['Year'] = pd.to_datetime(df['Year'],format='%Y')

In [None]:
df['Year']

In [None]:
df.dtypes

In [None]:
df.loc[df['Gross']/df['Budget'] > 2]

In [None]:
df['Profit Factor'] = df['Gross']/df['Budget']

In [None]:
df

In [None]:
df.iloc[4]

In [None]:
df.plot(x='Year',y='Gross')

In [None]:
df.plot(x='Year',y='Gross',kind='scatter')

# Calculating values and aggregating

In [None]:
df['Budget'].count()

In [None]:
df.count()

In [None]:
df['Budget'].mean()

In [None]:
df.groupby('Budget')['Gross'].mean()

In [None]:
df4group = pd.DataFrame({'folks':['a','b','c','d','e','f'],
                         'python skills':[1,2,1,3,3,1],
                         'monty python knowledge':[3,1,1,2,2,3]})

In [None]:
df4group

In [None]:
df4group.groupby('python skills').count()

In [None]:
df4group.groupby('python skills')['monty python knowledge'].mean()

# Final fun - estimating $\pi$ (again)

![image.png](attachment:image.png)

The fraction of sample points that make it into the circle is:

$$\frac{N_{inside}}{N_{total}} = \frac{\pi r^2}{4 r^2}$$

so we can use our sample to calculate $\pi$ via:

$$\pi = 4 \frac{N_{inside}}{N_{total}}$$

In [None]:
np.random.uniform(0,1)

In [None]:
pi_sample = pd.DataFrame(columns=['x','y','in_circle'])

In [None]:
x = np.random.uniform(0,1,1000)
y = np.random.uniform(0,1,1000)
in_circle = (((x-0.5)**2 + (y-0.5)**2) < 0.5**2)

In [None]:
pi_sample['x'] = x
pi_sample['y'] = y
pi_sample['in_circle'] = in_circle

In [None]:
pi_sample

In [None]:
pi_sample[pi_sample['in_circle'] == True]

In [None]:
pi_sample.groupby('in_circle').count()

In [None]:
d = pi_sample.groupby('in_circle')['x'].count()

In [None]:
d[True]

In [None]:
pi_sample.count()

In [None]:
pi_sample.groupby('in_circle')['x'].count()[True] / pi_sample.count()['in_circle'] * 4

In [None]:
def pi_estimate(nums = 1000):
    df = pd.DataFrame(columns=['x','y','in_circle'])
    df['x'] = np.random.uniform(0,1,nums)
    df['y'] = np.random.uniform(0,1,nums)
    df['in_circle'] = (((df['x']-0.5)**2 + (df['y']-0.5)**2) < 0.5**2)
    circle_count = df.groupby('in_circle')['x'].count()[True]
    total_count = nums
    estimated_pi = 4 * circle_count / total_count
    print('pi = '+str(estimated_pi))
    return df

In [None]:
testpi = pi_estimate(100)

In [None]:
testpi

# References
* https://pandas.pydata.org/pandas-docs/stable/user_guide/
* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html