# Introduction to Python's Data Science Stack

By Shahbaz Chaudhary

Python was created in 1990 by Guido van Rossum as a simple, beautiful language with a "batteries included" library. <br/><br/>


Zen of python:

* Beautiful is better than ugly
* Explicit is better than implicit
* Simple is better than complex
* Complex is better than complicated
* Readability counts

# Basic data types

In [None]:
# This is a comment
1 + 2 # Integers

In [None]:
1.0 + 3.0 # Floating points

In [None]:
"Hello world" # Strings

In [None]:
True and False # Boolean types, just 1 and 0

In [None]:
None # Null type

# Container data types

In [None]:
[1, 2, "buckle", "my" "shoe"] # lists (mixed types!)

In [None]:
(3, 4) # tuples are immutable, unlike lists

In [None]:
{"python":1990, "R":1993, "key":value} # dictionaries

In [None]:
{1, 2, 3, 3, 3, 4} # sets ignore multiple entries

# Variables

In [None]:
#Notice, not data type or declaration
x = 1

In [None]:
#Assign functions!
add_two = lambda x: x + 2

# Control Flow

In [None]:
isOkToProceed = True

#Notice the indentation!
if(isOkToProceed):
  print("Go ahead!")
else:
  print("Stop!")

In [None]:
#Loop
while(isOkToProceed):
  print("keep going!")

In [None]:
#range is used quite often
for x in range(10):
    print(x)

# Functions

In [None]:
#Remember from before, lambdas are anonymous funcs
add_three = lambda x : x + 3

In [None]:
#Indentation is important and no types
def add_two(n):
  return n + 2

In [None]:
#Notice keyword arguments
def delete_files(location, debug=True):
  delete_them_files(location)

In [None]:
delete_files("c:\\") # doesn't do anything
delete_files("c:\\", debug=False) # will delete files

# List awesomeness

In [None]:
mylist = [1,2,3,4,5,6,7,8,9]

mylist[1] # you know this, but python is zero based
mylist[1:5] # Get the 2nd, 3rd and 4th item
mylist[-1] # Get the last item
mylist[:4] # Get the first 3 items

# Even more list (and dictionary) awesomeness

In [None]:
mylist = [1,2,3,4,5,6,7,8,9]

# List comprehensions!
mylist_doubled = [e*e for e in mylist]

mylist_no_nine = [e for e in mylist if e != 9]

# Dictionary comprehensions!
mydictionary = {e:e*e for e in mylist}
mydictionary

# Brief, Python centric, history of the data science world

* Python created 1990
* R created 1993
* Python 2.0 2000
* Matplotlib 2003 (charting)
* Numpy 2006 (alternative to Matlab)
* ggplot2 2007
* Scikit-learn 2007 (machine learning library)
* Pandas 2008 (Python's implementation of R's dataframes)
* Python 3 2008
* Statsmodels 2010? (statistical models)
* Tensorflow 2015 (deep learning)

(src: wikipedia)

# Python data science stack

`import numpy as as np`

**Numpy** brings matrix math to Python. It contains n-dimensional matrices and linear algebra. It is mostly written in C and is super optimized.

`import pandas as pd`

**Pandas** is a port of R's dataframes.

`from sklearn import svm`

**Scikit-Learn** is the main machine learning library in Python. 

`import matplotlib.pyplot as plt`

**Matplotlib** is an extremely comprehensive charting library. Many user friendly packages use this as their base.

`import statsmodels.api as sm`

**Statsmodels** bring statistical models to Python.

**Jupyter** is the main tool data scientists write code and explore data

# This is Jupyter!

In [None]:
1+1

In [None]:
# Magic commands, such as this one to do simple performance testing
%time sum(range(10))

In [None]:
# Get help, right in the notebook!
sum??

# Numpy

### Create vectors and matrices

In [None]:
import numpy as np

print("\nconvert a python list to an array")
print(np.array([1,2,3]))

print("\nliteral, 2-dimensional array")
print(np.array([[1,2,3],[4,5,6]]))

print("\ninitialize to all zeros")
print(np.zeros((2,2))) 

print("\nIdentity matrix")
print(np.eye(2)) 

print("\nA 3x3 random matrix")
array3d = np.random.random((2,2,3))
print(array3d)


### Operate on those matrices

In [None]:
array2d = np.random.random((2,2))
print("\nMultiply the whole array by 10")
print(array2d * 10)

print("\nUnlike Python arrays, numpy arrays are typed")
print(array2d.dtype)

print("\nThey have pre-defined dimensions")
print(array2d.ndim, array2d.shape)

print("\nConvert their types")
print((array2d * 10).astype(int))

print("\nStandard aggregate functions work for all matrices")
print(np.sum((array2d)))

### Slice and dice them

In [None]:
array2dA = np.random.random((4,4))
array2dB = np.random.random((4,4))
print(array2dA)
print()
print(array2dB)

print()
print(array2dA > array2dB)

print()
print(array2dA[0:3, 0:3])

# Pandas

In [None]:
import pandas as pd
%matplotlib inline

#iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris_df = pd.read_csv('iris.csv')

In [None]:
iris_df.head()

In [None]:
iris_df.describe()

In [None]:
_ = iris_df.hist()

In [None]:
_ = iris_df[iris_df.sepal_width > 3.5].hist()

In [None]:
iris.groupby("species").mean()

join, merge, pivot, sample, imputation, find duplicates, etc.

# Intermission
## Let's talk about Arrow/Feather

In [None]:
#!conda install --y pyarrow -c conda-forge
import pyarrow

In [None]:
#Load 7.2 gigs of stock market data, 78,513,535 lines!
%time big_df = pd.read_csv('big.csv')

In [None]:
#big_df.to_feather('big.feather')
%time big_df = pd.read_feather('big.feather')

In [None]:
big_df = None

# Arrow allows data to read by pandas or R dataframes!

The arrow project is created by Hadley Wickham (R super guru) and Wes Mckinney (creator of Pandas). Watch for Arrow to make R and Python communities _much_ more collaborative! 

Similar work going on in Julia, Spark and other data science communities.

# Scikit-Learn

Contains algorithms for classification
* SVM
* Nearest Neighbors
* Random FOrest
* ...
Regression
* SVR
* OLS (ridge/lasso/elastic/...)
* ...
Clustering
* K-Means
* Spectral clsutering
* ...
Dimensionality reduction
* PCA
* Non-negative matrix factorization
* ...
Model selection
* Grid search
* Cross validation
* ...
Pre-processing
* One-hot encoding
* Train/Test split
* ...

In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from yellowbrick.classifier import ConfusionMatrix

In [None]:
iris_array = iris_df.values
X = iris_array[:, 0:4]
Y = iris_array[:, 4]

validation_size = 0.2
seed = 1

X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

#Source https://www.kaggle.com/cornhedgehog/iris-example

In [None]:
classifier = svm.SVC(gamma=0.001, C=100.)
classifier.fit(X_train, Y_train)
predictions = classifier.predict(X_validation)

print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

In [None]:
classifier = KNeighborsClassifier()
classifier.fit(X_train, Y_train)
predictions = classifier.predict(X_validation)

print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

Notice that the API is extremely consistent
![](sklearn_diff.png)

# Statsmodels

Closer to R's stats functions. Also contains a time-series component, but not _auto.arima_

In [None]:
from statsmodels.formula.api import ols

model = ols('sepal_width ~ species + petal_length', iris_df).fit()

In [None]:
model.summary()

# Statsmodel - Time Series
(source: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)

In [None]:
#!pip install pmdarima

In [None]:
from statsmodels.tsa.arima_model import ARIMA
import pmdarima as pm

df = pd.read_csv('wwwusage.csv', names=['value'], header=0)

model = pm.auto_arima(df.value, start_p=1, start_q=1,
                      test='adf',       # use adftest to find optimal 'd'
                      max_p=3, max_q=3, # maximum p and q
                      m=1,              # frequency of series
                      d=None,           # let model determine 'd'
                      seasonal=False,   # No Seasonality
                      start_P=0, 
                      D=0, 
                      trace=True,
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)

print(model.summary())

In [None]:
model.plot_diagnostics(figsize=(7,5))
plt.show()


... don't forget TensorFlow and PyTorch, the whole web app eco-system and MANY other libraries

### The End

In [4]:
import math

In [5]:
1+2

3

1

2