# (My) Scientific Python Toolbox








**Ariel Rossanigo**


### About me

* Ariel Rossanigo
* Artificial Intelligence Professor at UCSE-DAR
* Developer, Data Scientist



### Motivation

*Often people ask me about what I do and immmediatly the conversation turns to questions like:*

* Python? It's helpful for that?
* Isn't slowly?
* Why you don't use [put some knowed name here]?
* **What it's needed to do that?**


### Goals

* Share why Python and desmistify the **scientific** side
* Show the tools I use with some examples


### Why scientific?

* These are tools commonly related to scientific projects, but...
* **Everyone can take advantage of them**

#### What a scientific do?

* Process and visualize data
* Propose hypotheses and models
* Make predictions with the model and validate the hypotheses
* Communicate results

#### What a scientific needs?

* An easy programming language...
* With tools that simplify his job...
* Fast to develop and execute...
* Able to be deployed in production...

### Why Python?

* It's the language I know since many years...


### Why Python?

* It's the language I know since many years... and nothing made me change the election...
* Enables you to make almost everything... 
* Open source and free - Licence can be summarize as *Do whatever you want*
* **Lot of scientific tools** 
* **Enables to prototype models and deploy these in production environments**
* It's **slow** but every tool for heavy duty it's implemented in something faster...


## (My) Scientific Python Toolbox

* IPython + **Jupyter notebook** + **RISE**  
* Numpy +  SciPy                  
* **Pandas**     
* bokeh          
* **Matplotlib** 
* seaborn        
* **sklearn**    
* keras          
* tensorflow     

*All these apart from the normal: virtualenv, pip, git, etc...*

#### IPython 

* python interactive interpreter with steroids
* kernel for Jupyter notebook

#### Jupyter notebook

* Project Jupyter was born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages.

* The notebook it's an interactive interpreter with a web interface that it's composed as a secuence of cells, where everyone can be of different types, for example: code, videos, images, markdown, graphs... even latex...

$$c = \sqrt{a^2 + b^2}$$


#### RISE (Reveal.js Ipython Slideshow Extension)

* Notebook extension to make presentations (like this one)
* The presentation is executable

In [None]:
for x in range(2):
    print('Hola a todas!')

### Numpy

* Multidimensional arrays implemented in an eficient way
* It's the cornerstone of almost every scientific Python tool

### SciPy library

* Provides many user-friendly and efficient numerical routines for different topics, ie:

 * signal: signal processing
 * stats: probability distributions and statistical functions.
 * sparse: 2-D sparse matrix package
 * ...


### Pandas (Python Data Analysis Library)

* The *de facto* tool to work with data in Python
* Use numpy behind the scenes but...
 * propose cooler abstractions like Series and Dataframes
 * allows data manipulation in a SQL way

Let's see what we can do with the tools showed so far with an example...


### The example: Netflix prize (2009)

* 1MM USD for the winner
* 17.770 movies (these are listed in a text file named movie_titles.txt)
* 480.000 users
* **100.480.507 votations** distributed in 17.770 text files

#### Challenge goal

*Create a model to predict user predictions for some movies not in data*

#### Our goal

*Show the things we can do with the tools previously showed*

**Reading a csv**

In [None]:
from IPython.display import display
import numpy as np
import pandas as pd
from utils import data_path # this is a function to simplify path management

movies = pd.read_csv(data_path('movie_titles.txt'), 
                     names=['movie_id', 'year_of_release', 'title'], 
                     index_col='movie_id',
                     encoding='latin-1')

print("Number of movies: {:,}".format(len(movies)))
movies.head()

**Some operations with Pandas**

In [None]:
# aggregations 
first_year = movies.year_of_release.min()
print("First year a release was made: {}".format(first_year))

# filtering the movies that were released that year
(movies[movies.year_of_release==first_year])

# movies from this millennium that have Monty in their name
(movies[(movies.year_of_release >= 2000) & 
               (movies.title.str.contains('Monty'))])

In [None]:
# What king of magic do that filter ??...
# display(movies[(movies.year_of_release >= 2000) & (movies.title.str.contains('Monty'))])

display((movies.year_of_release >= 2000).head())

### Reading some votes from a pickle

In [None]:
all_ratings = pd.read_pickle(data_path('ratings.pkl'))
print('number of ratings: {:,}'.format(len(all_ratings)))
all_ratings.head()

#### 5 MM registers... quickly enough....

####  What about the RAM consumption?

In [None]:
all_ratings.info()

### What happens if we want to show movies data along with votes?

In [None]:
# something like a SQL join
all_together = pd.merge(movies, all_ratings, left_index=True, 
                        right_on='movie_id')

print('Number of ratings: {:,}'.format(len(all_together)))
display(all_together.head())

#### Top 10 Most voted movies

In [None]:
top_10 = (all_together.groupby(['movie_id', 'title']).movie_id
                      .count()
                      .sort_values(ascending=False)
                      .head(10))
display(top_10)

#### Number of movies released by year

In [None]:
by_year = movies.groupby(movies.year_of_release).size().sort_index()
display(by_year.head())

**This doesn't say so much...**

### One image is worth a thousand numbers...

**Matplotlib**

* Most used package for 2D plots in Python
* Syntax mode similar to Matlab 
* Graphs can be included in the notebook
* Integration with Pandas

Apart from Matplotlib exists others like **bokeh**, **seaborn** and much more...

#### Number of movies released by year

In [None]:
%matplotlib inline
import matplotlib
# seaborn import change the style
import seaborn as sns
# default size for graphs
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 4

by_year.plot();

#### Movies and votes

In [None]:
stars = pd.crosstab([all_together.movie_id, all_together.title], 
                    all_together.stars)
stars.columns = list(map(str, stars))
stars.head()

#### Expresed in percentage... (normalized)

In [None]:
normalized = stars.div(stars.sum(axis=1), axis=0)
normalized.head()

### How we can separate the goods and the bads?

### sklearn (Machine learning in Python)

* Have almost everything related to machine learning in Python
 * Preprocessing
 * Clasification y Regression
 * Clustering
 * Dimentionality reduction
 * Metrics

* Build over NumPy, SciPy y matplotlib

### How we can separate the goods and the bads?


In [None]:
from sklearn.cluster import KMeans
model = KMeans(3, random_state=1) # 3 groups, random's seed fixed in this examples
model.fit(normalized)
labels = model.predict(normalized)
display(labels[:5])
display(normalized[:5])

In [None]:
film_ratings = ['Bad', 'Good', 'Neutral']
normalized['group'] = 'Bad'
normalized.loc[labels==1, 'group'] = 'Good'
normalized.loc[labels==2, 'group'] = 'Neutral'
normalized.head()

#### How we can show something that represents the groups?

In [None]:
import radar  # auxiliary functionality to make the graph

df = pd.DataFrame(model.cluster_centers_)
df.columns=[str(x+1) for x in range(5)]
radar.plot_radar(df, df.columns, legends=[film_ratings[x] for x in df.index]) 

### What if we want to show not only the centers?

In [None]:
d = normalized[:500].copy()
d['color'] = 'g' 
d.loc[d.group=='Bad', 'color']  = 'r'
d.loc[d.group=='Neutral', 'color'] = 'b'
radar.plot_radar(d, titles=[str(x+1) for x in range(5)], colors=d.color, normalize=True, fill=False)

**Doesn't make much sense...** 

#### What happens if we keep only with the qualification that have more quantity of votes 

In [None]:
normalized.head()

In [None]:
s = normalized.loc[:, '1':'5'].idxmax(axis=1)
s.head()

In [None]:
v = normalized.loc[:, '1':'5'].max(axis=1)
v.head()

In [None]:
x = (s.values.astype(np.int16) +
     np.random.normal(scale=0.15, size=len(normalized)))
df = pd.DataFrame({'star': x, 
                   'value': v.values,
                   'group': normalized['group']})

ax = df[df.group=='Bad'].plot.scatter(x='star', y='value', color='r', label='Bad', figsize=(12, 6), alpha=0.6)
df[df.group=='Neutral'].plot.scatter(x='star', y='value', color='b', label='Neutral', ax=ax, alpha=0.6)
df[df.group=='Good'].plot.scatter(x='star', y='value', color='g', label='Good', ax=ax, alpha=0.6);

### Conclusions

* Python is a well suite language to do science
* Jupyter notebook, Pandas and Matplotlib are great tools to make data exploratory 
* If you want to perform Machine Learning, take a look on sklearn

### Thanks! Questions?


<p><img src="../common/imgs/gmail-1162901_960_720.png" width="20" style="float: left;" align="middle"> arielrossanigo@gmail.com</p>

<p><img src="../common/imgs/twitter-312464_960_720.png" width="20" style="float: left;" align="middle"> @arielrossanigo</p>

<p><img src="../common/imgs/github-154769__340.png" width="20" style="float: left;" align="middle"> https://github.com/arielrossanigo</p>

