# Case Study - Counting pandas

## How to count all the functions, methods and attributes that pandas has to offer?
There are multiple intelligent ways to do this but for this exercise we will start off by assuming the [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html) in the pandas docs contain all the functionality of pandas. Full URL: http://pandas.pydata.org/pandas-docs/stable/api.html

Wow, thats an absurd amount of functionality for one library. Manually counting this might take some time. Lets use pandas to help us out.

In [1]:
import pandas as pd

### Finding pages with html tables

Many times it will not be obvious that a web page consists of html tables. For example, the Pandas api reference web page does not appear to have what you would normally define as a 'table'. However, all modern browsers have functionality to nicely display the contents of the current html page. In chrome you can right click **inspect** or **view page source**. If you click inspect, then the html for that object will be directly navigated to.

Once inspecting the html you can use search functions to find html tables which are always written with **`<table>`** elements.

Go ahead and inspect the api page and see if the underlying elements are indeed html tables.

### `read_html` to scrape tables

Pandas has a handy-dandy function **`read_html`** which reads all the html tables off of the given url. It returns a list of pandas dataframe objects - one for each table found. Let's use this now to grab every single table on that page.

In [2]:
# grab all html tables from api reference page
api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html')

In [3]:
# how many tables are there
len(api_tables)

157

In [4]:
#lets look at a few tables
api_tables[0]

Unnamed: 0,0,1
0,"read_pickle(path[, compression])",Load pickled pandas object (or any object) fro...


In [5]:
# take a look at another table
api_tables[44]

Unnamed: 0,0,1
0,Categorical.dtype,The CategoricalDtype for this instance
1,Categorical.categories,The categories of this categorical.
2,Categorical.ordered,Whether the categories have an ordered relatio...
3,Categorical.codes,The category codes of this categorical.


Looks like they are all two column tables with the attribute in the first column and the description in the right column. Every thing looks good. Lets try counting

In [6]:
count = 0
for table in api_tables:
    count += len(table)
print(f"There are {count} things pandas can do!")

There are 1331 things pandas can do!


## How much functionality does the pandas Series have?
If we want to count just the Series functionality we can search the tables for the text `Series.`. The `read_html` function provides a `match` parameter to filter out tables based on regular expression text match. We provide it the regular expression `Series[.]` which will match the literal text `Series.`.

In [7]:
series_api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html', match='Series[.]')
print(f"There are {len(series_api_tables)} Series tables")

There are 26 Series tables


Let's inspect a couple of them and verify the results.

In [8]:
series_api_tables[0]

Unnamed: 0,0,1
0,Series.index,The index (axis labels) of the Series.


In [9]:
series_api_tables[1]

Unnamed: 0,0,1
0,Series.values,Return Series as ndarray or ndarray-like depen...
1,Series.dtype,return the dtype object of the underlying data
2,Series.ftype,return if the data is sparse|dense
3,Series.shape,return a tuple of the shape of the underlying ...
4,Series.nbytes,return the number of bytes in the underlying data
5,Series.ndim,return the number of dimensions of the underly...
6,Series.size,return the number of elements in the underlyin...
7,Series.strides,return the strides of the underlying data
8,Series.itemsize,return the size of the dtype of the item of th...
9,Series.base,return the base object if the memory of the un...


In [10]:
count_series = sum([len(table) for table in series_api_tables])
print(f"There are {count_series} things pandas Series can do!")

There are 356 things pandas Series can do!


# Exercises

## Problem 1
<span  style="color:green; font-size:16px"> Writing a new for loop every time we want to count a new word in our dataset is cumbersome. Can you write a function that accepts the parameter **word** and returns the count of this word if it appears as in the pandas API as a functions/methods/attributes. Count a few words with it like DataFrame or MultiIndex</span>

In [11]:
def count_functionality(obj):
    pass

## Problem 2
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Navigate to the [real clear politics 2016 Clinton vs Trump poll page][1]
1. Use Pandas `read_html` to read in that full table at the **bottom** of the page and display it here in the notebook
1. Use the `header` parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the Clinton and Trump data types are float64
1. Add a column that calculates the difference of trump vs Clinton

[1]: http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html

# Solutions

## Problem 1
<span  style="color:green; font-size:16px"> Write a function that accepts the parameter **obj**, a Pandas object as a string, and returns the functionality count. Count a few words with it like DataFrame or MultiIndex</span>

In [12]:
def count_functionality(obj):
    api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html', obj + '[.]')
    count = sum([len(table) for table in api_tables])
    print(f"There are {count} things pandas {obj} can do!")

In [13]:
count_functionality('Series')
count_functionality('DataFrame')
count_functionality('MultiIndex')

There are 356 things pandas Series can do!
There are 264 things pandas DataFrame can do!
There are 24 things pandas MultiIndex can do!


## Problem 2
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Navigate to the [real clear politics 2016 Clinton vs Trump poll page][1]
1. Use Pandas `read_html` to read in that full table at the **bottom** of the page and display it here in the notebook
1. Use the `header` parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the Clinton and Trump data types are float64
1. Add a column that calculates the difference of trump vs Clinton

[1]: http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html

In [14]:
url = 'http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html'
rcp_tables = pd.read_html(url, header=0)

In [15]:
len(rcp_tables)

3

The bottom table is the last one.

In [16]:
clinton_trump = rcp_tables[2]
clinton_trump.head()

Unnamed: 0,Poll,Date,Sample,MoE,Clinton (D),Trump (R),Spread
0,Final Results,--,--,--,48.2,46.1,Clinton +2.1
1,RCP Average,11/1 - 11/7,--,--,46.8,43.6,Clinton +3.2
2,BloombergBloomberg,11/4 - 11/6,799 LV,3.5,46.0,43.0,Clinton +3
3,IBD/TIPP TrackingIBD/TIPP Tracking,11/4 - 11/7,1107 LV,3.1,43.0,42.0,Clinton +1
4,Economist/YouGovEconomist,11/4 - 11/7,3669 LV,--,49.0,45.0,Clinton +4


In [17]:
clinton_trump.dtypes

Poll            object
Date            object
Sample          object
MoE             object
Clinton (D)    float64
Trump (R)      float64
Spread          object
dtype: object

In [18]:
clinton_trump['diff'] = clinton_trump['Clinton (D)'] - clinton_trump['Trump (R)']

In [19]:
clinton_trump.head(20)

Unnamed: 0,Poll,Date,Sample,MoE,Clinton (D),Trump (R),Spread,diff
0,Final Results,--,--,--,48.2,46.1,Clinton +2.1,2.1
1,RCP Average,11/1 - 11/7,--,--,46.8,43.6,Clinton +3.2,3.2
2,BloombergBloomberg,11/4 - 11/6,799 LV,3.5,46.0,43.0,Clinton +3,3.0
3,IBD/TIPP TrackingIBD/TIPP Tracking,11/4 - 11/7,1107 LV,3.1,43.0,42.0,Clinton +1,1.0
4,Economist/YouGovEconomist,11/4 - 11/7,3669 LV,--,49.0,45.0,Clinton +4,4.0
5,LA Times/USC TrackingLA Times,11/1 - 11/7,2935 LV,4.5,44.0,47.0,Trump +3,-3.0
6,ABC/Wash Post TrackingABC/WP Tracking,11/3 - 11/6,2220 LV,2.5,49.0,46.0,Clinton +3,3.0
7,FOX NewsFOX News,11/3 - 11/6,1295 LV,2.5,48.0,44.0,Clinton +4,4.0
8,MonmouthMonmouth,11/3 - 11/6,748 LV,3.6,50.0,44.0,Clinton +6,6.0
9,NBC News/Wall St. JrnlNBC/WSJ,11/3 - 11/5,1282 LV,2.7,48.0,43.0,Clinton +5,5.0
