## Essential pandas functionality



In [1]:
import pandas as pd
import numpy as np

### Function application and mapping



In [6]:
frame = pd.DataFrame(-np.arange(12).reshape((4,3)), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

Let's apply a Numpy *ufunc*



In [7]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


Write a lambda function and apply it with `frame.apply(f)`



In [9]:
#f = lambda x: x[0]+10
frame.apply(lambda x: x+10)

Unnamed: 0,b,d,e
Utah,10,9,8
Ohio,7,6,5
Texas,4,3,2
Oregon,1,0,-1


#### =pd.applymap=



In [11]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,0.0,-1.0,-2.0
Ohio,-3.0,-4.0,-5.0
Texas,-6.0,-7.0,-8.0
Oregon,-9.0,-10.0,-11.0


### Sorting



In [14]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj
obj.sort_index()
obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

#### Sorting DataFrames



In [15]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [16]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [19]:
print(frame)
print(frame.sort_values(by='b'))
frame.sort_values(by=['a', 'b'])

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1


Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


## Descriptive statistics



A number of common mathematical and statistical methods are available for Series and DataFrames



In [21]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])

In [23]:
df.mean()

one    3.083333
two   -2.900000
dtype: float64

In [24]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [25]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


### Descriptive and summary statistics



![img](images/pdsummary.png)



## Exercise



Let's grab some data of stock prices and volumes from Yahoo! finance

    conda install pandas-datareader



In [27]:
#import pandas_datareader.data as web
#all_data = {ticker: web.get_data_yahoo(ticker)
#            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
import pickle
all_data = pickle.load(open('all_data.pkl', 'rb'))
all_data

{'AAPL':                   High         Low        Open       Close       Volume  \
 Date                                                                      
 2010-01-04   30.642857   30.340000   30.490000   30.572857  123432400.0   
 2010-01-05   30.798571   30.464285   30.657143   30.625713  150476200.0   
 2010-01-06   30.747143   30.107143   30.625713   30.138571  138040000.0   
 2010-01-07   30.285715   29.864286   30.250000   30.082857  119282800.0   
 2010-01-08   30.285715   29.865715   30.042856   30.282858  111902700.0   
 2010-01-11   30.428572   29.778572   30.400000   30.015715  115557400.0   
 2010-01-12   29.967142   29.488571   29.884285   29.674286  148614900.0   
 2010-01-13   30.132856   29.157143   29.695715   30.092857  151473000.0   
 2010-01-14   30.065714   29.860001   30.015715   29.918571  108223500.0   
 2010-01-15   30.228571   29.410000   30.132856   29.418571  148516900.0   
 2010-01-19   30.741428   29.605715   29.761429   30.719999  182501900.0   
 201

1.  Create two **DataFrames** for the price ('Adj Close') and volume ('Volume')
2.  Compute the percent change in prices for the Google stock price (i.e. returns)
3.  Use a one-liner to calculate the average of the total value of stocks traded during the period downloaded
4.  Calculate the correlation (`corr` and `corrwith`) of the returns of the IBM and Microsoft stocks
5.  Plot the volume of the IBM trades executed



In [63]:
price = pd.DataFrame({ticker: all_data[ticker]['Adj Close'] for ticker in all_data})
volume = pd.DataFrame({ticker: all_data[ticker]['Volume'] for ticker in all_data})
returns = price.pct_change()
returns[['IBM', 'MSFT']].corr()

Unnamed: 0,IBM,MSFT
IBM,1.0,0.48655
MSFT,0.48655,1.0


## Reading and writing data in text format



Although `read_csv` and `read_table` are the most commonly used, a number of other functions are available

![img](images/parsing1.png)

![img](images/parsing2.png)



Optional arguments for these functions fall into categories:

-   Indexing
-   Type inference and data conversion
-   Datetime parsing
-   Iterating
-   Unclean data

## CSV files



In [None]:
!cat examples/ex1.csv

In [None]:
df = pd.read_csv('examples/ex1.csv')
df

In [None]:
pd.read_table('examples/ex1.csv', sep=',')

### What if there's no header row?



In [None]:
pd.read_csv('examples/ex2.csv', header=None)

In [None]:
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

In [None]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

### We can also form a hierarchical index from multiple columns



In [None]:
!cat examples/csv_mindex.csv

In [None]:
parsed = pd.read_csv('examples/csv_mindex.csv', index_col=['key1', 'key2'])
parsed

### What if the delimiter is non-standard?



In [None]:
!cat examples/ex3.txt

In [None]:
result = pd.read_table('examples/ex3.txt', sep='\s+')
result

### Missing values



In [None]:
!cat examples/ex4.csv

In [None]:
result = pd.read_csv('examples/ex4.csv')
result

The `na_values` option can take either a list or set of strings to consider missing
values



In [None]:
result = pd.read_csv('examples/ex4.csv', na_values=['NULL'])
result

In [None]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('examples/ex4.csv', na_values=sentinels)

### Some read_csv/read_table function arguments



![img](images/parsing3.png)

![img](images/parsing4.png)



## Write data to text format



In [None]:
data = pd.read_csv('examples/ex4.csv')
data

We can use the `to_csv` method



In [None]:
data.to_csv('out.csv')

\*Look at the function's documentation and write the file 

-   using a different delimiter
-   using the word MISSING for missing data
-   with no header
-   with only the first two columns



## JSON data



JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications.



In [1]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [5]:
import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [6]:
json.dumps(result)

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

### How do we convert from JSON to a DataFrame?



In [12]:
import pandas as pd
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings
json.dump(obj, open("example.json", 'w'))

In [13]:
!cat example.json

"\n{\"name\": \"Wes\",\n\"places_lived\": [\"United States\", \"Spain\", \"Germany\"],\n\"pet\": null,\n\"siblings\": [{\"name\": \"Scott\", \"age\": 30, \"pets\": [\"Zeus\", \"Zuko\"]},\n{\"name\": \"Katie\", \"age\": 38,\n\"pets\": [\"Sixes\", \"Stache\", \"Cisco\"]}]\n}\n"

In [None]:
data = pd.read_json('example.json')
data

In [None]:
print(data.to_json())

## Homework



1.  Install the `lxml`, `beautifulsoup4` and `html5lib` modules and familiarize yourself with their functionality.
2.  The EPA uses the RatNet to monitor radiation in the air across the country. Download the data from [https://www.epa.gov/radnet/radnet-csv-file-downloads>](https://www.epa.gov/radnet/radnet-csv-file-downloads>)for Worcester, MA in 2018. 
    -   What was the maximum dose equivalent rate for 2018?
    -   What was the average dose equivalent rate for the 20 largest gamma count rates? What is the standard deviation for the different channels?
    -   Which gamma channel would be the best predictor for dose equivalent rate?
    -   If 90 nSv/h were considered the safe threshold, how many days in 2018 were not safe?
    -   Find the average and variance of the gamma count rates, when the dose rate was between 82 and 88 nSv/h.
    -   Write a CSV file that has a column designating the dose rate as safe or unsafe.
    -   Use an one-liner to calculate the range of values for each gamma count channel.

