# Confidence Interval Plot Python
> A tutorial of creating confidence interval plot in python.

- toc: false 
- badges: true
- comments: true
- categories: [altair, python]
- image: images/chart-preview.png

# About

This blog post details how to create confidence interval plot in python using Altair Visualization package. Altair is a declarative statistical visualization library based on vega and vega-lite. This is one my favourite visualization package in pythons. More details can be found [here](https://altair-viz.github.io/getting_started/overview.html)

Lets load the package and get data from cars data set. 

In [26]:
import altair as alt
import numpy as np
import pandas as pd
from vega_datasets import data

source = data.cars()

source.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


### Create a plot showing how mile per gallon change by year 
Altair has built in capabilities to create this visualization  
1. Lets create a base line cart showing the average mile per gallon per year 
2. Create a confidence interval band chart using the mark_errorband()  
3. Layer the line and CI band chart to create the final visualization

In [27]:
line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon')))

(band + line).properties(title='Confidence Interval Plot of miles per gallon')

Lets say if you want to under how mileage varies by origin. This can done by simply encoding color in the plot 

In [28]:
line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)',
                color='Origin'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
                color='Origin'))

(band + line).properties(title='Confidence Interval of miles per gallon by country')

### Create confidence interval plot from grouped data 

Most of situation in real world you have large a dataset and still need to plot confidence interval plots.In this scenario it is better to pre compute the confidence interval based on mean and margin of error. Lets create a pandas dataframe with required fields as show below : 

In [29]:
df=(source
 .groupby(['Year'])
 .agg(avg_mpg=('Miles_per_Gallon','mean'),
     std_mpg=('Miles_per_Gallon','std'),
     n=('Miles_per_Gallon','count'))
 .assign(ul=lambda x:x['avg_mpg']+1.96*x['std_mpg']/np.sqrt(x['n']),
        ll=lambda x:x['avg_mpg']-1.96*x['std_mpg']/np.sqrt(x['n']))
 .reset_index()
)

df.head()

Unnamed: 0,Year,avg_mpg,std_mpg,n,ul,ll
0,1970-01-01,17.689655,5.339231,29,19.632937,15.746373
1,1971-01-01,21.25,6.591942,28,23.69169,18.80831
2,1972-01-01,18.714286,5.435529,28,20.727634,16.700938
3,1973-01-01,17.1,4.700245,40,18.556621,15.643379
4,1974-01-01,22.703704,6.42001,27,25.125345,20.282062


Few lines of code below create the custom confidence interval plot required

In [30]:
line = (alt
        .Chart(df).mark_line(color='blue')
        .encode(x='Year',
                y='avg_mpg'))

band = (alt
        .Chart(df)
        .mark_area(opacity=0.5,color='blue')
        .encode(x='Year',
                y=alt.Y('ll', title='Miles/Gallon'),
                y2=alt.Y2('ul', title='Miles/Gallon')))

(band + line).properties(title='Confidence Interval of miles per gallon by country(Custom)')

### Conclusion

Confidence interval plot is one the most important tool in a data scientist tool kit to understand uncertainty of the metrics. Altair provides excellent visualization capabilities to make this plot few line of python code.  

In [32]:
source.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


In [33]:
import numpy as np
import pandas as pd
import altair as alt

# Generate some random data
rng = np.random.RandomState(1)
x = rng.rand(40) ** 2
y = 10 - 1.0 / (x + 0.1) + rng.randn(40)
source = pd.DataFrame({"x": x, "y": y})

# Define the degree of the polynomial fits
degree_list = [1, 3, 5]

base = alt.Chart(source).mark_circle(color="black").encode(
        alt.X("x"), alt.Y("y")
)

polynomial_fit = [
    base.transform_regression(
        "x", "y", method="poly", order=order, as_=["x", str(order)]
    )
    .mark_line()
    .transform_fold([str(order)], as_=["degree", "y"])
    .encode(alt.Color("degree:N"))
    for order in degree_list
]

alt.layer(base, *polynomial_fit)

In [41]:
import numpy as np
import pandas as pd
import altair as alt

# Generate some random data
rng = np.random.RandomState(1)
x = rng.rand(40) ** 2
y = 10 - 1.0 / (x + 0.1) + rng.randn(40)
source = pd.DataFrame({"x": x, "y": y})

# Define the degree of the polynomial fits
degree_list = [1, 3, 5]

base = alt.Chart(source).mark_circle(color="black").encode(
        alt.X("x"), alt.Y("y")
)

polynomial_fit = [
    base.transform_regression(
        "x", "y", method="poly", order=order, as_=["x", str(order)]
    )
    .mark_line()
    .transform_fold([str(order)], as_=["degree", "y"])
    .encode(alt.Color("degree:N"))
    for order in degree_list
]

alt.layer(base, *polynomial_fit)

In [36]:
polynomial_fit

[alt.Chart(...), alt.Chart(...), alt.Chart(...)]

In [31]:
def get_decile(arr):
    perc=np.array([0,10,20,30,40,50,60,70,80,90,100])
    out=[round(np.percentile(arr,i),2) for i in perc]
    #a=np.percentile(arr,100)
    return out
from plotnine import *
from mizani.formatters import percent_format,custom_format

from plotnine import *
from mizani.formatters import percent_format,custom_format

def percentile_plot(df_data,metrics,nice_name):
    percentiles=df_data \
    .dropna(subset=[metrics])\
    .groupby(["variant"]).agg({metrics:get_decile})[metrics].apply(pd.Series)\
    .rename(columns= lambda x: str(x*10)+"%" ).reset_index()\
    .drop(["0%","100%"],1)\
    .merge(df_data.dropna(subset=[metrics]).groupby(["variant"]).agg({metrics:np.mean}).reset_index()
           ,on='variant')\
    .rename(columns={metrics:"average"})

    new_percentile=percentiles.set_index("variant")\
    .unstack().reset_index()\
    .rename(columns={"level_0":'Percentile',0:nice_name})
    
    #print(percentiles.head())

    dollar_formatter = custom_format('${:.2f}')

    return (ggplot(data=new_percentile)
    + aes(x='variant',y=nice_name,color='Percentile', group='Percentile')
    + geom_point() 
    + geom_line()
    + xlab("variant")
    + geom_hline(yintercept=0,linetype='--',color='r')
    + ggtitle("Transactional {} distribution by offer type".format(nice_name))
    + scale_y_continuous(labels=dollar_formatter)
    )

#percentile_plot(df_data,metrics='ogp_usd',nice_name='oGP$')

ModuleNotFoundError: No module named 'plotnine'