# Some advanced things that you can do in Python. 

First import packages you'll be using. 

For this portion, I'm going to feature [statsmodels](https://www.statsmodels.org/stable/index.html), [Altair](https://altair-viz.github.io), and [Geopandas](https://geopandas.org).
- Statsmodels is a packages that can do R-like formula statistics. I'll just be showing you an OLS linear regression. 
- Altair is still a work-in-progress, but it has some great features, as you will see. 
- Geopandas will allow you to easily plot a dataframe that has geospatial data.

In [None]:
import pandas as pd
import numpy as np
import altair as alt

import statsmodels.formula.api as smf
import re as re

Load our processed data.

In [None]:
df = pd.read_csv('../data/processed/merged_winner_pop_gdp.csv')

Check and clean our data

In [None]:
for col in df.columns:
    print(col, df[col].dtype)

In [None]:
df.head()

### Add some new variables. 

In [None]:
df['GDP (in log10 billions)'] = np.log10(df['GDP']/1000000000.0)
df['Population (in log10 billions)'] = np.log10(df['Population']/1000000000.0)
df['Total Medals (log10)'] = np.log10(df['Total Medals'])

# Run some stats with statsmodels

Statsmodels formule don't deal well with special characters and spaces... 

In [None]:
df_smf = df.rename(columns=lambda x: re.sub('[^\w\s]','',x))
df_smf = df_smf.rename(columns=lambda x: re.sub('[\s]','',x))


In [None]:
df_smf.head()

In [None]:
model = smf.ols('TotalMedalslog10 ~ GDPinlog10billions * Populationinlog10billions', data=df_smf)
results = model.fit()
results.summary()

#### The bad news is that, unlike R, the plotting and statistical packages don't talk to each other.

This means that in order to plot this interaction, you will have to create the variables you want to plot, e.g. residualized variables and whatnot.

***


# Make some plots with Altair.

Make a basic plot. 

Each Altair plot is a chart that has _data_, some kind of _mark_, and _ecodes_ some variables on its axes. 

In [None]:
alt.Chart(df[df['Year']==1992]).mark_circle(
        color='red',
        size=100,
        opacity=0.3
    ).encode(
        x='Total Medals',
        y='GDP (in log10 billions)',
    )

The beauty of altair is that it's quite easy to add interactivity. 

In [None]:
alt.Chart(df[df['Year']==1992]).mark_circle(
        color='red',
        size=100,
        opacity=0.3
    ).encode(
        x='Total Medals',
        y='GDP (in log10 billions)',
        tooltip=['Team', 'Year', 'Gold', 'Silver', 'Bronze'] # so that we get a mouseover. 
    )

Altair provides some interactivity that get's _pretty fancy._

Here, I'm adding a second plot and a selector element. 

In [None]:
# create a selector 
selector = alt.selection_single(empty='all', fields=['Team'])

# shared base
base = alt.Chart(df).properties(
    width=250,
    height=250
).add_selection(selector)

# plot #1, much like before. 
points = base.mark_point(filled=True, 
        opacity=0.3,
        size=200).encode(
    x='Total Medals (log10)',
    y='GDP (in log10 billions)',
    tooltip=['Team', 'Year', 'Gold', 'Silver', 'Bronze'],
    color=alt.condition(selector, 'Team:O', alt.value('lightgray'), legend=None)
    # ^^ this line adds a conditional color, which depends on the selector. 
)

# plot #2, encodes year and medals, 
# It has a transform filter that depends on the selector. 
timeseries = base.mark_line().encode(
    x='Year:O', # the :O means treat this as ordinal, rather than numerical. 
    y='Total Medals',
    color=alt.Color('Team:O', scale=alt.Scale(scheme='sinebow'), legend=None)
).transform_filter(
    selector
).add_selection(selector)

# plot the two plots next to each other!
# Altair requires you return to jupyter the plot that you have made. 
#     This means, the last line in your cell needs to be the plot. 
points | timeseries 

### Save plot as html file. 

In [None]:
(points | timeseries).save('../reports/figures/Medals_vs_GDP_byYear.html', embed_options={'renderer':'svg'})



# Challenge:

    What's up with 1992? 

***

# Geopandas

This can be a pain to install, so I didn't include the imports above. If you want to get it going, you will have to run:
```
pip install geopandas
```

Depending on your python installation, it may fail. 

In [None]:
import matplotlib.pyplot as plt
import geopandas as gpd


### Load the data

In [None]:
df_world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [None]:
df_world.head()

### Manipulate the data
I happened to notice that "United States of America" is not the same as "United States"

In [None]:
df_world['name'] = df_world['name'].replace("United States of America", "United States")

### merge the geographic data with our olympic dataframe. 

In [None]:
df_world_olympics = df_world.merge(df.groupby('Country').sum(), left_on="name", right_on="Country", how='inner')

In [None]:
# df_world_olympics.head()
# print(df_world_olympics.columns)
# print(df_world_olympics['name'].unique())

In [None]:
df_world_olympics.head()

### Make the geo-plot

In [None]:
f, ax = plt.subplots(1,1,figsize=(16,4))
df_world_olympics.plot('Total Medals (log10)', legend=True, ax=ax, 
                        legend_kwds={'label': "Total Medals (log10)"})
plt.axis('off');

plt.savefig('../reports/figures/TotalMedals_map.pdf', bbox_inches = "tight")


# Challenge:

Can you make this plot for Medals won in 1992? 

(hint: you will need to create the right dataframe)