## Getting Started with Statistics and Python

This notebook walks through the basics of getting your environment setup to play with statisitcs. In later notebooks I will just include this boiler plate without introduction. **If you don't mind glossing over boilerplate code, move onto the next notebook.**

### Environment

This notebook assumes you are either running locally or on Google's Colabratory environment. Since some commands are environment-dependent, the method below will try figure out where you are. YMMV on another platform, or as this notebook ages. 

#### Locally

Clone this repository, navigate to the root of the repository, and run `jupyter notebook`. For more info go to: http://jupyter.org/install

#### Google Colabratory

Go to: https://colab.research.google.com/

Once logged in, select Github and enter: https://github.com/cscollett/PlayingWithStats

In [None]:
# Method for determining a Google Colab env or not
def at_google_colab():
    try:
        config = get_ipython().config 
        if config['IPKernelApp']['kernel_class'] == 'google.colab._kernel.Kernel':
            return True
        else:
            return False
    except NameError:
        return False

# where are we?
location = None
if at_google_colab():
    location = 'at Google'
else:
    location = 'locally'

# print prediction
print('I think you are running {}!'.format(location))

### Importing Libraries

We will be using several packages to help with the math and plotting.

*NumPy* - "NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays." https://en.wikipedia.org/wiki/NumPy

*Pandas* - "In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series." https://en.wikipedia.org/wiki/Pandas_(software)

In [None]:
import numpy as np
import pandas as pd

*Matplotlib* - "Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy." https://en.wikipedia.org/wiki/Matplotlib

In [None]:
import matplotlib
import matplotlib.pyplot as plt    
if at_google_colab:
    %matplotlib inline
else:
    %matplotlib notebook

### 1. Pandas DataFrame Tutorial
Most mathy work in python is done with NumPy arrays or Pandas Dataframes. The differences between Python dictionaries and NumPy arrays are that NumPy arrays better-suited towards mathametical operations than dictionaries. Likewise, Pandas dataframes are also geared towards mathamatical operations than Python dictionaries, but are more geared towards data management, like relational database tables.

https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7

In [None]:
d = {'language': ['C','Python 3', 'Rust', 'Javascript', 'Java', 'Haskell', 'Perl'], \
     'b-tree_bmark_secs': [2.54, 93.40, 4.31, 23.78, 8.29, 25.13, 183.13]}
print("Python Dictionary")
print(type(d))
print(d)
np_array = np.array(d)
print("\n")

print("NumPy Array")
print(type(np_array))
print(np_array)
print("\n")

print("Pandas Dataframe")
pd_dataframe = pd.DataFrame(data=d)
print(type(pd_dataframe))
print(pd_dataframe)

Notice, the dataframe has rows with index places, and columns with headers. To better sense the size and of scale your dataframe, you can call them:

In [None]:
print("Shape of data (rows, columns): {}".format(pd_dataframe.shape))
print("Size (count of all values): {}".format(pd_dataframe.size))

Print all the columns of the dataframe.

In [None]:
print(pd_dataframe.columns)

Print the data types. 

In [None]:
print("Data Types: {}".format(pd_dataframe.dtypes))

When your data is very big, you may only want to print out a couple rows.

In [None]:
# default for header is 5 rows, but you can specify any number
print(pd_dataframe.head(3))

_**Q**: Guess the command to print the last three rows?_

In [None]:
#   language  b-tree_bmark_secs
# 4     Java               8.29
# 5  Haskell              25.13
# 6     Perl             183.13

If you want only the data, you can print a simple numpy array, with no columns or index rows. 

In [None]:
print(type(pd_dataframe.values))
print(pd_dataframe.values)

Or maybe you only want the column as a list, in which case you can grab the series.

In [None]:
print(type(pd_dataframe['b-tree_bmark_secs']))
print(pd_dataframe['b-tree_bmark_secs'])

It may be handy to know you can sort in a dataframe too.

In [None]:
# this returns a sorted dataframe, the original 'dataframe' remains original order.
sorted_df = pd_dataframe.sort_values(by=['b-tree_bmark_secs'])
print(sorted_df)

Finally, you may want to search for a subset. This is a little unintuitive if you are used to SQL. Let's say we want all the language less than 10. The first step is to create an evaluation of every row on a condition, returning a Pandas Series.

In [None]:
select_criteria = pd_dataframe['b-tree_bmark_secs'] < 10
print(type(select_criteria))
print(select_criteria)

Then you pass that series into the original dataframe. It will only return the rows that are true in the series you passed.

In [None]:
print(pd_dataframe[select_criteria])

For shorthand you can combine it into one statement.

In [None]:
print(pd_dataframe[pd_dataframe['b-tree_bmark_secs'] < 10])

## Done