# Introduction to Python in Data Science

> Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured...

> Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.

## History of Data Science

* 1650-1700: origins of probability, term "statistics" coined. Pascal, Huygens.
* 1700-1750: Jackob Bernoulli’s law of large numbers.
* 1750-1800: conditional probability with applications to inverse probability or Bayesian inference.
* 1800-1850: method of least squares (Legendre), central limit theorem (Laplace), Poisson, Gauss
* 1850-1900: correlation, chi square, population, histogram, standard deviation, Pearson
* 1900-1950: rank correlation, factor analysis (Spearman), time series, Markov chains, Student's (Gosset) t-distribution, hypotesis testing, discriminant analysis (Fisher), PCA (Hotelling), Kolmogorov-Smirnov test, Wilcoxon test
* 1957: Fortran
* 1958: Perceptron (Rosenblatt)
* 1966: AI winter begin
* 1975: Backpropagation, Tufte
* 1984: Matlab
* 1985: Boltzmann machine
* 1987: Excel
* 1991: Python
* 1993: R
* 1995: SVM
* 1997: RNN
* 1998: CNN (LeCun)
* 1999: breakthrough in GPU
* 2000: BigData era
* 2006: Hadoop
* 2008: Pandas
* 2011: IPython (Jupyter)
* 2012: Deep learning (Hinton, Ng) (AI winter end)
* 2015: TensorFlow

## What roles are in Data Science?

Please take a look at infographics:

[The Data Science Industry](https://www.datacamp.com/community/tutorials/data-science-industry-infographic)

## Python Data Science toolset

- **NumPy**: stands for Numerical Python. Operations on n-arrays and matrices in Python, vectorization, performance. Low-level.
- **SciPy**: built upon NumPy. Linear algebra, optimization, integration, and statistics.
- **Pandas**: built upon NumPy. High-level data (usually tabular) wrangling.

### Visualization
- **Matplotlib**: SciPy stack core package. Pretty low-level.
- **Seaborn**: based on Matplotlib. Mostly focused on the visualization of statistical models.
- **Bokeh**: focused on interactivity, dashboards, browser rendering (SVG, but makes no use of d3.js).
- **Plotly**: built on top of d3.js, dashboards, convert from matplotlib.

### Machine Learning
- **SciKit-Learn**: built on the top of SciPy. A de-facto industry standard for machine learning with Python.

### Deep Learning
- **Theano**: tightly integrates with NumPy, optimizes the use of GPU and CPU.
- **TensorFlow**: opensourced by Google. Built upon concept of computational graphs. Does not use numpy. Has GPU optimizations.
- **Keras**: uses Theano or TensorFlow as its backends. Minimalistic approach in the design, quick prototyping.

## Introduction to Numpy

Highly efficient array reimplementation. Unlike Python lists, Numpy list items must contain the same type. Also multi-dimensional arrays supported.

In [None]:
import numpy as np

x = np.array([1, 2, 3, 4])
x

In [None]:
x.dtype

In [None]:
x.shape

If types do not match, Numpy attempts upcasting:

In [None]:
np.array([1, 2, 3.14])

Force type:

In [None]:
np.array([1, 2, 3], dtype='float32')

Ways to specify array type:

[Built-in scalar types](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.scalars.html#arrays-scalars-built-in)

In [None]:
np.array([1.1, 2.2, 3.3], dtype='int64')

In [None]:
np.array([1.1, 2.2, 3.3], dtype=int)

In [None]:
np.array([1.1, 2.2, 3.3], dtype=np.int64)

Multidimensional arrays:

In [None]:
x = np.array([[1, 2, 3], [3, 4, 5], [6, 7, 8]])
x

In [None]:
x.shape

Some other ways of array creation:

In [None]:
np.zeros(8)

In [None]:
np.zeros((3, 3), dtype=int)

In [None]:
np.ones((2, 2), dtype=int)

In [None]:
np.full(5, 3.14)

Identity matrix:

In [None]:
np.eye(3, dtype=int)

Evenly spaced numbers:

In [None]:
np.linspace(0, 1, 4)

Matrix of random numbers:

In [None]:
np.random.randint(0, 10, (3, 3))

Numbers in a row:

In [None]:
x = np.arange(1, 10)
x

Reshape arrays:

In [None]:
x.reshape(3, 3)

In [None]:
x2.ndim

### Array access and indexing

In [None]:
x[0]

In [None]:
x[-1]

Slicing works too:

In [None]:
x[::2]

In [None]:
indices = [0, 3, 6]
x[indices]

Multi-dimensionals:

In [None]:
x2 = np.array(np.arange(0, 12), dtype=int).reshape(3, 4)
x2

In [None]:
x2[1, 1]

In [None]:
x2[1]

In [None]:
x2[1, :]

In [None]:
x2[:, 1]

Fun with slicing:

In [None]:
x2[:2, :2]

In [None]:
x2[1:, 1:]

In [None]:
x2[::2, ::2]

In [None]:
x2[::-1, ::-1]

In [None]:
row_indices = [0, 2]
x2[row_indices, :]

In [None]:
col_indices = [0, 3]
x2[row_indices, col_indices]

In [None]:
x2.T

Replacement using slicing also works:

In [None]:
x2[0, :] = [100] * 4
x2

In [None]:
x2[::2, ::2] = 888
x2

### Array arithmetics

Operations are vectorized. Avoid loops!

In [None]:
np.array([1, 2]) + np.array([3, 4])

In [None]:
np.array([1.0, 2.0, 3.0]) * np.array([2.0, 2.0, 2.0])

Array broadcasting. Two dimensions are compatible when:
- they are equal, or
- one of them is 1

In [None]:
np.array([7]).shape

In [None]:
np.array([1.0, 2.0, 3.0]) + 7

In [None]:
np.zeros((3, 3)) + np.array([1, 1, 1])

#### Universal functions (ufuncs)

In [None]:
x = [1, 2, 3, 10]

np.exp2(x)

In [None]:
np.power(3, x)

In [None]:
np.log10(x)

#### Aggregates:

In [None]:
x = np.arange(1, 10)
x

In [None]:
np.add.reduce(x)

In [None]:
np.multiply.reduce(x)

Intermediate results of the computation:

In [None]:
np.add.accumulate(x)

More about vectorization:

In [None]:
big_array = np.random.rand(10000)
%timeit sum(big_array)
%timeit np.sum(big_array)

In [None]:
%timeit min(big_array)
%timeit np.min(big_array)

### Compound types

In [None]:
df = np.empty(2, dtype={'names': ('name', 'age'), 'formats': ('U10', int)})
df

In [None]:
df['name'] = ['Petya', 'Vasya']
df['age'] = [23, 32]
df

In [None]:
df[0]

In [None]:
df['name']

In [None]:
df[0]['name']

## Introduction to Pandas

At first glance:

In [None]:
import pandas as pd
import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()

In [None]:
type(iris)

In [None]:
iris.shape

In [None]:
iris.columns

Explore data frame:

In [None]:
iris.tail(3)

In [None]:
iris.describe()

### Series: basic Pandas object

In [None]:
s = pd.Series(['alpha', 'beta', 'gamma'])
s

In [None]:
s.values

In [None]:
s.index

Series are indexed.

In [None]:
s[2]

In [None]:
s.index = ['a', 'b', 'g']
s['g']

Series can be created from dict (since dict doesn't guarantee items order, pandas sorted it):

In [None]:
d = {'python': 10, 'java': 5, 'php': -1, 'ruby': 8, 'c': 3, 'assembler': 0}
sd = pd.Series(d)
sd

Indices allow slicing:

In [None]:
sd['java':'python']

### DataFrame

Pandas DataFrame can be created in many ways:
- [from lists and dictionaries](http://pbpython.com/pandas-list-dict.html)
- from Series
- from other DataFrames

From dicts:

In [None]:
sales = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': 95 }]
pd.DataFrame(sales)

In [None]:
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'],
         'Jan': [150, 200, 50],
         'Feb': [200, 210, 90],
         'Mar': [140, 215, 95]}
pd.DataFrame.from_dict(sales)

From lists:

In [None]:
sales = [('Jones LLC', 150, 200, 50),
         ('Alpha Co', 200, 210, 90),
         ('Blue Inc', 140, 215, 95)]

labels = ['account', 'Jan', 'Feb', 'Mar']

pd.DataFrame.from_records(sales, columns=labels)

In [None]:
sales = [('account', ['Jones LLC', 'Alpha Co', 'Blue Inc']),
         ('Jan', [150, 200, 50]),
         ('Feb', [200, 210, 90]),
         ('Mar', [140, 215, 95]),]

pd.DataFrame.from_items(sales)

From Series:

In [None]:
periods = ['September', 'October', 'November', 'December']

income = pd.Series([300, 1000, 2000, 5000], index=periods)
profit = pd.Series([0, 10, 20, 500], index=periods)

firm = pd.DataFrame(dict(income=income, profit=profit))
firm

In [None]:
firm['income']

In [None]:
type(firm['income'])

In [None]:
firm.income # please don't use this feature

In [None]:
firm[['income']]

In [None]:
firm['income']['December']

In [None]:
firm['income']['September':'November']

A source of confusion: while indexing refers to columns, slicing refers to rows.

In [None]:
firm['September':'November']

DataFrame "looks" like dictionary:

In [None]:
firm.keys()

In [None]:
'income' in firm

Coerce to numpy array:

In [None]:
firm.values

#### .loc and .iloc

A source of confusion:

In [None]:
s = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
s

In [None]:
s[3]

In [None]:
s[1:3]

.iloc: references the implicit Python-style index

In [None]:
s.iloc[1]

explicit index with .loc

In [None]:
s.loc[1]

Now with DataFrame:

In [None]:
firm.iloc[0, 0]

In [None]:
firm.iloc[:1, :1]

In [None]:
firm.loc['October', 'income']

### Most often used method of slicing

In [None]:
firm['income'] > 1000

In [None]:
firm[firm['income'] > 1000]

In [None]:
firm.query("income > 1000")

### Merge DataFrames

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})

In [None]:
display(df1, df2)

In [None]:
pd.merge(df1, df2)

### Aggregation and grouping

In [None]:
df = pd.DataFrame({'A': np.random.randint(0, 10, 5),
                   'B': np.random.randint(0, 10, 5)})
df

In [None]:
df.mean()

In [None]:
df.mean(axis='columns')

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

In [None]:
df.groupby('key')

In [None]:
df.groupby('key').sum()

### Pivot tables

In [None]:
titanic = sns.load_dataset('titanic')

titanic.head()

In [None]:
titanic.groupby('sex')[['survived']].mean()

In [None]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

In [None]:
titanic.pivot_table('survived', index='sex', columns='class')

## Steps in a typical Data Science project

* Raw data extraction
* Data wrangling
* Exploratory data analysis
* Knowledge extraction (machine learning, statistical models, other algorithms)
* Report

## Data Visualization and Exploratory data analysis

Diagram of the Causes of Mortality in the army in the East by **Florence Nightingale, 1855**

![](http://historyofinformation.com/images/3815a%20Large.jpg)

Carte figurative des pertes successives en hommes de l’Armée Française dans la campagne de Russie 1812-1813 by **Charles Minard, 1869**

![](http://mapdesign.icaci.org/wp-content/uploads/2014/08/MapCarte237_minard_large.png)

![](https://apandre.files.wordpress.com/2011/02/chartchooserincolor.jpg)

## Visualization with Matplotlib

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['font.size'] = 24
plt.rcParams['legend.fontsize'] = 'large'
plt.rcParams['figure.titlesize'] = 'medium'
plt.rcParams['figure.figsize'] = (16.0, 8.0)

### Barplot

In [None]:
titanic.groupby('pclass').survived.sum().plot(kind='bar')

In [None]:
titanic.groupby(['sex', 'pclass']).survived.sum().plot(kind='barh')

### Stacked barplot

In [None]:
death_counts = pd.crosstab([titanic.pclass, titanic.sex], titanic.survived.astype(bool))
death_counts.plot(kind='bar', stacked=True, color=['black','gold'], grid=False)

### "Standardized" barplot

In [None]:
death_counts.div(death_counts.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=['black','gold'])

### Histograms

Default 10 bins:

In [None]:
titanic.fare.hist(grid=False)

In [None]:
titanic.fare.hist(bins=30)

### Boxplots

Quartiles and the lower and upper 5 percent values + outliers

In [None]:
titanic.boxplot(column='fare', by='pclass', grid=False)

In [None]:
bp = titanic.boxplot(column='age', by='pclass', grid=False)
for i in [1,2,3]:
    y = titanic.age[titanic.pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

### Scatter plot

In [None]:
iris['color'] = iris["species"].astype('category')
iris['color'].cat.categories = ['r', 'b', 'k']

iris.plot(kind="scatter", x="sepal_length", y="sepal_width", c=iris['color'])

# Homework

* Create new Jupyter notebook
* Copy next cell contents to your notebook, run it (may take several seconds to complete)
* This will load a dataset of cars scrapped from SS.COM into a Pandas DataFrame
* Find interesting insights in this dataset, be creative!
* When you're done, send your notebook to **pimenoff@gmail.com**

In [None]:
import urllib.request, json

dataset = json.loads(urllib.request.urlopen('https://github.com/danaki/ss-scraper-sample/raw/master/data/cars-20170927.json').read())
cars = pd.DataFrame(dataset)
cars.drop(columns=['_changes', '_id'])
cars.describe()

In [None]:
real_mileage = cars.query('mileage < 100000')
real_mileage['age'] = 2018 - real_mileage['production_year']
real_mileage.plot(kind='scatter', x='mileage', y='age')