### Applying functions in pandas

If you work with large data frames going over a list of elements or a series is not efficient and probably slow. It is more effective to make use of functions in pandas. To this end, we will make use of the __apply( )__ function.

In [None]:
import pandas as pd
df = pd.read_csv('data/suicide_data.csv')

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
def suicide_rating(x):
    if x >= 16.0:
        return 'high'
    else:
        if x <= 1.0:
            return 'low'
        else:
            return 'medium'

In [None]:
df['rating'] = df['suicides/100k pop'].apply(suicide_rating)
df.head()

In [None]:
#let's rename some columns
df.rename(columns={' gdp_for_year ($) ':'gdp_year', 'gdp_per_capita ($)':'gdp_cap'}, inplace=True)
df.columns

In [None]:
#we can apply a lambda function to a column, to remove the colons
df['gdp_year'] = df['gdp_year'].apply(lambda x: str(x).replace(',', ''))
df.gdp_year

In [None]:
#let's add a period so we can correctly convert the content to numerical data 
df['gdp_year'] = df['gdp_year'].apply(lambda x: x[:-3] + '.' + x[-3:])
df['gdp_year'] = pd.to_numeric(df['gdp_year'], errors='coerce')

df.gdp_year

In [None]:
dfT = df.groupby(['country'])['gdp_year'].mean()
dfT.sort_values(inplace=True)
dfT.head(150)

If we want to make use of the added categories low, mid, high, we can group by country and rating and sum up the suicides per category:

In [None]:
dfG= df.groupby(['country', 'rating'])['suicides/100k pop'].count()
dfG = pd.DataFrame(dfG)
dfG.head(20)

So to sum it up, we were able to create a new column describing the situation in a country regarding the suicide rate. We created a function __suicide_rating__ which creates a new value (low, medium, high) depending on a threshold which, in this case, we too from the 1st and 3rd qaurtile (see the description of the DataFrame).<br>
After aggregating over each country we now have a grouping of the absolute suicide rates per country into these three categories.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
dfSum = df.groupby(['country'])['suicides/100k pop'].count()
dfSum

In [None]:
dfMean = df.groupby(['country'])['gdp_year'].mean()
dfMean

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import string

In [None]:
#pd.concat([s1, s2], axis=1).reset_index()
dfNew = pd.concat([dfSum, dfMean], axis=1)
dfNew

In [None]:
dfNew.reset_index('country', inplace=True)
dfNew.rename(columns={'suicides/100k pop':'x', 'gdp_year':'y', 'country':'val'}, inplace=True)
dfNew

In [None]:
fig, ax = plt.subplots()
ax = dfNew.set_index('x')['y'].plot(style='o')

def label_point(x, y, val, ax):
    a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    for i, point in a.iterrows():
        ax.text(point['x'], point['y'], str(point['val'])[:3])

label_point(dfNew.x, dfNew.y, dfNew.val, ax)

dfNew.plot('x', 'y', kind='scatter', ax=ax, figsize=(12, 8))

Ok, so this plot needs a bit polishing, but this is something to be done in the data cleaning / plotting section.<br>
What information can you derive from this?