<small><i>March 2018 - This notebook was created by [Santi Seguí](www.cvc.uab.es/people/ssegui/). Source and license info are in the folder.</i></small>

<h1>Data Visualization with IPython</h1>
<h4>Matplotlib</h4>

The easiest way to interact with matplotlib is via pylab in iPython. By starting iPython (or iPython notebook) in "pylab mode", both matplotlib and numpy are pre-loaded into the iPython session:

ipython notebook --pylab
You can specify a custom graphical backend (e.g. qt, gtk, osx), but iPython generally does a good job of auto-selecting. Now matplotlib is ready to go, and you can access the matplotlib API via plt. If you do not start iPython in pylab mode, you can do this manually with the following convention:

import matplotlib.pyplot as plt

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import randn
from scipy import stats
import pandas as pd
import seaborn as sns # pip install seaborn
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
sns.set_style("whitegrid")

<b>Data to play</b>

In [None]:
titanic = sns.load_dataset("titanic")

In [None]:
#Let see how Titanic data looks like
titanic.head()

Some question we would like to visualize?
* Age distribution
* Age distribution on death vs. survived
* Number of deads/surviveds per class?
* Did the change to survive equal in all classes?

<h5><b>Data Distribution</b> <br></h5>
1 Single Variable : <b>Column Histogram</b>

In [None]:
data=pd.Series(titanic.age.values).dropna()
plt.hist(data);
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("# persons")
plt.show()

By default histograms are plotted with 10 bins of equal size. Basically, the more bins we have the more sensitive is the plot to high-frequency patterns in the distribution.

In [None]:
data=pd.Series(titanic.age.values).dropna()
plt.hist(data, bins = 20);
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("# persons")
plt.show()

In [None]:
plt.hist(data, 20, color=sns.desaturate("indianred", .75), histtype='barstacked');
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("# persons")
plt.show()

1 Single Variable : ** Boxplot **

In [None]:
plt.boxplot(data, 1)
plt.title("Age distribution")
plt.ylabel("# persons")
plt.show()

1 Single Variable / 2 Distributions : <b>Stacked Column Histogram</b>

In [None]:
data1=pd.Series(titanic[titanic.alive=="yes"].age.values).dropna()
plt.hist(data1, bins=20, color="#6495ED",histtype='stepfilled', label='Survived');
data2=pd.Series(titanic[titanic.alive=="no"].age.values).dropna()
plt.hist(data2, bins=20, color="#F08084",histtype='stepfilled',label='Death');

plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("# persons")

plt.legend(loc='best',mode="expand", borderaxespad=0.)
plt.show()

The <b>alpha argument</b> can also be useful to see all data from both distributions

In [None]:
#Perhaps some transparency will help
data1=pd.Series(titanic[titanic.alive=="yes"].age.values).dropna()
plt.hist(data1, bins=20, color="#6495ED",histtype='stepfilled', alpha=0.5, label='Survived');
data2=pd.Series(titanic[titanic.alive=="no"].age.values).dropna()
plt.hist(data2, bins=20, color="#F08084",histtype='stepfilled', alpha=0.5, label='Death');
plt.legend(loc='best',mode="expand", borderaxespad=0.)

plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("# persons")

plt.show()

In [None]:
#And what about a normalization per class?
data1=pd.Series(titanic[titanic.alive=="yes"].age.values).dropna()
plt.hist(data1, bins=20, color="#6495ED",histtype='stepfilled',alpha=0.5,normed=True);
data2=pd.Series(titanic[titanic.alive=="no"].age.values).dropna()
plt.hist(data2, bins=20, color="#F08084",histtype='stepfilled',alpha=0.5,normed=True);
plt.title("Age distribution")
plt.xlabel("Age")
plt.ylabel("% persons")
plt.show()

Let's check how it looks with another dataset

In [None]:
data1 = stats.poisson(2).rvs(90)
data2 = stats.poisson(5).rvs(400)

In [None]:
max_data = np.r_[data1, data2].max()
bins = np.linspace(0, max_data, max_data + 1)
plt.hist(data1, bins, color="#6495ED",histtype='stepfilled');
plt.hist(data2, bins, color="#F08084",histtype='stepfilled');

The <b>normed argument</b> can also be useful if you want to compare two distributions that do not have the same number of observations. Note also that bins can be a sequence of where each bin starts.

In [None]:
max_data = np.r_[data1, data2].max()
bins = np.linspace(0, max_data, max_data + 1)
plt.hist(data1, bins, normed=True, color="#6495ED",histtype='stepfilled');
plt.hist(data2, bins, normed=True, color="#F08084",histtype='stepfilled');

In [None]:
max_data = np.r_[data1, data2].max()
bins = np.linspace(0, max_data, max_data + 1)
plt.hist(data1, bins, normed=True, color="#6495ED",alpha=0.5,histtype='stepfilled');
plt.hist(data2, bins, normed=True, color="#F08084",alpha=0.5,histtype='stepfilled');


In [None]:
sns.distplot(data1,bins,color="#6495ED",norm_hist=True);
sns.distplot(data2,bins,color="#F08084",norm_hist=True);

https://stanford.edu/~mwaskom/software/seaborn/tutorial/distributions.html

<h5><b>Data Comparision</b> <br></h5>
<b>BAR CHARTS</b> 


In [None]:
t = titanic.groupby(['pclass']).size()

plt.bar(t.index,t.values,align='center',color=sns.color_palette("Set2", 3))
plt.xticks([1,2,3], ['1st Class', '2nd Class', '3rd Class'], rotation='horizontal')

plt.title('Passengers per class');
plt.ylabel('Number of Passengers');
plt.xlabel('');

How to see the number of deads/surviveds per class?

In [None]:
print(titanic.groupby(['pclass', 'survived']).size())
t = titanic.groupby(['pclass', 'survived']).size().unstack()
print(t)
red, blue = '#F08084', '#6495ED'

plt.bar([1,2,3], t[0], color=red, label='Died' ,align='center')
plt.bar([1,2,3], t[1], bottom=t[0], color=blue, label='Survived',align='center')
plt.xticks([1,2,3], ['1st Class', '2nd Class', '3rd Class'], rotation='horizontal')

plt.ylabel("Number")
plt.xlabel("")
plt.title("Passengers per class")
plt.legend(loc='upper left')
plt.show()

<h5><b>Data Composition</b> <br></h5>
<b>Stacked 100% Column Chart</b>
<br>If I want to visualize if the class dependes with the chances to survive:

In [None]:
#normalize each row by transposing, normalizing each column, and un-transposing
t = (1. * t.T / t.T.sum()).T

plt.bar([1,2,3], t[0], color=red, label='Died',align='center')
plt.bar([1,2,3], t[1], bottom=t[0], color=blue, label='Survived',align='center')
plt.xticks([1,2,3], ['1st Class', '2nd Class', '3rd Class'], rotation='horizontal')
plt.ylabel("Fraction")
plt.xlabel("")
plt.title("Proportion of surviver per class")
plt.legend(loc="upper left", bbox_to_anchor=(1,1))

plt.show()

<b>PIE CHARTS</b> 

In [None]:
t = titanic.groupby(['pclass']).size()

plt.subplot(121)
plt.pie(t, labels=['1st Class', '2nd Class', '3rd Class'])
plt.title("Passenger Class on the Titanic");
plt.subplot(122,aspect=True)
plt.pie(t, labels=['1st Class', '2nd Class', '3rd Class'], colors=sns.color_palette("Set2", 3),autopct='%i%%',startangle=90)
plt.title("Passenger Class on the Titanic");
plt.axis('equal')
plt.tight_layout()
plt.show()

## Understand flight trip.
### 1) Flights per year

In [None]:
flights = sns.load_dataset("flights")
flights.head()

In [None]:
t = flights.groupby(['year']).sum()

plt.bar(t.index,t.passengers.values,align='center')
for x, y in zip(t.index, t.values):
    plt.text(x, y + 60, '%d' % y, ha='center', va='bottom');
    
plt.xlim(1948.5,1960.5)
plt.xticks( np.arange(1949, 1961, 1))
plt.yticks([])
plt.title('Flights per year');

plt.ylabel('Number of flights');
plt.xlabel('Year');

### 2) flights per month using tables

In [None]:
flights_rect = flights.pivot("month", "year", "passengers")
flights_rect = flights_rect.ix[flights.month.iloc[:12]]
flights_rect.head()

### 3) flights per month using heatmaps

In [None]:
sns.heatmap(flights_rect);

In [None]:
sns.heatmap(flights_rect, annot=True, fmt="d");

<h5>Relationship of Two Variables</h5>
<b>SCATTER PLOT</b>


In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
plt.scatter(x, y, c=sns.color_palette("pastel", N), alpha=0.9)
plt.show()

<h5>Relationship of Three Variables</h5>
<b>Bubble Chart</b>

In [None]:

area = np.pi * (20 * np.random.rand(N))**2 # 0 to 20 point radiuses

plt.scatter(x, y, s=area, c=sns.color_palette("pastel", N), alpha=0.6)
plt.show()

## The presence of woman in the US Academia

In [None]:
from pandas import read_csv  
sns.set_style("white")
# Read the data into a pandas DataFrame.  
gender_degree_data = read_csv("http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv")  

In [None]:
gender_degree_data.head()

In [None]:
#How should we display it?? Try to do some vizualization 

<b>A coold Visualiation using simple LINE PLOTS</b>

In [None]:
# These are the "Tableau 20" colors as RGB.  
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),  
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),  
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),  
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),  
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
#cm=sns.palplot(sns.color_palette("Set2", 20))
  
# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.  
for i in range(len(tableau20)):  
    r, g, b = tableau20[i]  
    tableau20[i] = (r / 255., g / 255., b / 255.)  
    
#You typically want your plot to be ~1.33x wider than tall. This plot is a rare  
# exception because of the number of lines being plotted on it.  
# Common sizes: (10, 7.5) and (12, 9)  
plt.figure(figsize=(12, 14))  
  
# Remove the plot frame lines. They are unnecessary chartjunk.  
ax = plt.subplot(111)  
ax.spines["top"].set_visible(False)  
ax.spines["bottom"].set_visible(False)  
ax.spines["right"].set_visible(False)  
ax.spines["left"].set_visible(False)  
  
# Ensure that the axis ticks only show up on the bottom and left of the plot.  
# Ticks on the right and top of the plot are generally unnecessary chartjunk.  
ax.get_xaxis().tick_bottom()  
ax.get_yaxis().tick_left()  
  
# Limit the range of the plot to only where the data is.  
# Avoid unnecessary whitespace.  
plt.ylim(0, 90)  
plt.xlim(1968, 2014)  
  
# Make sure your axis ticks are large enough to be easily read.  
# You don't want your viewers squinting to read your plot.  
plt.yticks(range(0, 91, 10), [str(x) + "%" for x in range(0, 91, 10)], fontsize=14)  
plt.xticks(fontsize=14)  

# Provide tick lines across the plot to help your viewers trace along  
# the axis ticks. Make sure that the lines are light and small so they  
# don't obscure the primary data lines.  
for y in range(10, 91, 10):  
    plt.plot(range(1968, 2012), [y] * len(range(1968, 2012)), "--", lw=0.5, color="black", alpha=0.3)  
  
 #Remove the tick marks; they are unnecessary with the tick lines we just plotted.  
plt.tick_params(axis="both", which="both", bottom="off", top="off",  
                labelbottom="on", left="off", right="off", labelleft="on")  
  
# Now that the plot is prepared, it's time to actually plot the data!  
# Note that I plotted the majors in order of the highest % in the final year.  
majors = ['Health Professions', 'Public Administration', 'Education', 'Psychology',  
          'Foreign Languages', 'English', 'Communications\nand Journalism',  
          'Art and Performance', 'Biology', 'Agriculture',  
          'Social Sciences and History', 'Business', 'Math and Statistics',  
          'Architecture', 'Physical Sciences', 'Computer Science',  
          'Engineering']  
  
for rank, column in enumerate(majors):  
    # Plot each line separately with its own color, using the Tableau 20  
    # color set in order.  
    plt.plot(gender_degree_data.Year.values,  
            gender_degree_data[column.replace("\n", " ")].values,  
            lw=2.5, color=tableau20[rank]);
      
    # Add a text label to the right end of every line. Most of the code below  
    # is adding specific offsets y position because some labels overlapped.  
    y_pos = gender_degree_data[column.replace("\n", " ")].values[-1] - 0.5  
    if column == "Foreign Languages":  
        y_pos += 0.5  
    elif column == "English":  
        y_pos -= 0.5  
    elif column == "Communications\nand Journalism":  
        y_pos += 0.75  
    elif column == "Art and Performance":  
        y_pos -= 0.25  
    elif column == "Agriculture":  
        y_pos += 1.25  
    elif column == "Social Sciences and History":  
        y_pos += 0.25  
    elif column == "Business":  
        y_pos -= 0.75  
    elif column == "Math and Statistics":  
        y_pos += 0.75  
    elif column == "Architecture":  
        y_pos -= 0.75  
    elif column == "Computer Science":  
        y_pos += 0.75  
    elif column == "Engineering":  
        y_pos -= 0.25  
      
    # Again, make sure that all labels are large enough to be easily read  
    # by the viewer.  
    #text(2011.5, y_pos, column, fontsize=14)  
    plt.text(2011.5, y_pos, column, fontsize=14, color=tableau20[rank])
      
# matplotlib's title() call centers the title on the plot, but not the graph,  
# so I used the text() call to customize where the title goes.  
  
# Make the title big enough so it spans the entire plot, but don't make it  
# so big that it requires two lines to show.  
  
# Note that if the title is descriptive enough, it is unnecessary to include  
# axis labels; they are self-evident, in this plot's case.  
plt.text(1995, 93, "Percentage of Bachelor's degrees conferred to women in the U.S.A."  
       ", by major (1970-2012)", fontsize=17, ha="center");
  
# Always include your data source(s) and copyright notice! And for your  
# data sources, tell your viewers exactly where the data came from,  
# preferably with a direct link to the data. Just telling your viewers  
# that you used data from the "U.S. Census Bureau" is completely useless:  
# the U.S. Census Bureau provides all kinds of data, so how are your  
# viewers supposed to know which data set you used?  
plt.text(1966, -8, "Data source: nces.ed.gov/programs/digest/2013menu_tables.asp"  
       "\nAuthor: Randy Olson (randalolson.com / @randal_olson)"  
       "\nNote: Some majors are missing because the historical data "  
       "is not available for them", fontsize=10)

# sample from http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/

## Compare features

If we want to print the relation from different features of a dataset we can use the <b>Paired Plot</b> inlcuded in seaborn package

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Fisher (1936) as an example of discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

In [None]:
iris    = sns.load_dataset("iris")
iris.head()

In [None]:
sns.pairplot(iris,  size=2.5);

In [None]:
sns.pairplot(iris,hue="species", size=2.5);