# Graphs with python. 3
## Barplots

API SEABORN - Barplot

https://seaborn.pydata.org/generated/seaborn.barplot.html

Bar charts represent the values of a numerical variable in relation to a categorical variable, that is to say, it shows the numerical values according to the different levels.

In one axis it represents the categories and in the others the values.

Barplots are similar to histograms, although they simply show how a single numerical variable is distributed (without taking into account categories)


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
categorias = ['A', 'B', 'C']
valores = [1, 5, 3]

sns.barplot(x=categorias, y=valores)

In [None]:
tendas = ['A Coruña', 'Betanzos', 'Oleiros']
vendas = [100000, 10000, 30000]

sns.barplot(x=tendas, y=vendas)

In [None]:
# When to use a bar chart?
# A bar chart is useful for finding or highlighting differences between different groups of data.

In [None]:
# Data on arms sales in 2020
# Source: https://www.data-to-viz.com/story/OneNumOneCat.html
# SIPRI Arms Industry Database: https://sipri.org/databases/armsindustry

df_venda_armas = pd.read_csv('../datasets/venda_armas_paises.csv')
df_venda_armas


In [None]:
sns.barplot(data=df_venda_armas, x='country',y='total_sales')

In [None]:
# Problems with the visualization:
# - We cannot read country identifiers.

In [None]:
# We increase the size of the graphic
plt.figure(figsize=(20,7))
sns.barplot(data=df_venda_armas, x='country',y='total_sales', )

In [None]:
# It still doesn't work
# We can change the orientation of the graphic
plt.figure(figsize=(12,10))
sns.barplot(data=df_venda_armas, y='country',x='total_sales')

In [None]:
# Problem: it is difficult to identify the order of the countries.
# Solution: order the dataframe
df_venda_armas.sort_values(by='total_sales',ascending=False,inplace=True)
plt.figure(figsize=(12,10))
sns.barplot(data=df_venda_armas, y='country',x='total_sales')

In [None]:
# Os datos non sempre estarán preparados, senón que teremos que facer cálculos

# The data will not always be ready, but we will have to make calculations.
df_armas = pd.read_csv('../datasets/venda_armas_2020.csv',sep=';')
df_armas.head()

In [None]:
# Barplot pintaría as medias dos países

# Barplot would paint the averages of two countries
df_armas.sort_values(by='total_sales',ascending=False,inplace=True)
plt.figure(figsize=(12,10))
sns.barplot(data=df_armas, y='country',x='total_sales')

In [None]:
df_armas_agrupadas = df_armas.groupby('country',as_index=False)
#df_armas_agrupadas = df_armas.groupby('country')
df_armas_por_pais = df_armas_agrupadas.sum()
df_armas_por_pais

In [None]:
sns.barplot(data=df_armas_por_pais.sort_values(by='total_sales',ascending=False), y='country',x='total_sales')

In [None]:
# Although we are representing the summations in our example, to visualize total quantities,
# it is very common to use bar charts to show how a variable behaves as a function of 
# certain categories. That's why the default barplot function is the mean (avg) makes sense.

# In addition to painting the means, Barplot also represents the variability of the values using black lines
# lines that it adds to the graph.

In [None]:
# For testing purposes, loads the famous dataset with data from the Titanic's crew.
# Seaborn provides some
df_titanic = sns.load_dataset('titanic')

In [None]:
df_titanic.head()

In [None]:
# We can visualize the average ages of the different travelers according to sex.
# Or, in other words, we can answer the following question:
# Are there differences between ages according to sex?
sns.barplot(data = df_titanic, x = 'sex', y= 'age')

In [None]:
# Are the rates different for men and women?
sns.barplot(data = df_titanic, x = 'sex', y= 'fare')

In [None]:
# We can consider bool variables as numeric. In this case 'survived' is also a numeric type.
# Question: were women more likely to survive the catastrophe?
sns.barplot(data = df_titanic, x = 'sex', y= 'survived')

In [None]:
# Xogar coa orientación pode axudar a acentuar ou decrementar as diferenzas e contrastes
# Cal dos gráficos pensas que proxecta unha diferenza maior?

# Playing with orientation can help to accentuate or decrease the differences and contrasts.
# Which of the graphs do you think makes the biggest difference?
sns.barplot(data = df_titanic, y = 'sex', x= 'survived')

In [None]:
# In relation to the type of ticket of the passengers, did this influence your chances of salvation?
# Did you expect this result?
sns.barplot(data = df_titanic, x = 'class', y= 'survived')

In [None]:
# Was age an important factor in survival?
sns.barplot(data = df_titanic, x = 'age', y= 'survived')

In [None]:
# Too many categories for a barplot, right?
# Remember that a barplot is useful to represent numerical values versus categorical variables.

In [None]:
# And now, for something completely different..

# LAMBDA functions

# Lambda functions are unnamed functions that will be applied only once, so it is not worth defining them globally.
# it is not worth defining them in a global way.

# A very simple example would be a lambda function that multiplies by 10

# ( lambda function definition)
# ( variable lambda: operation on the variable)

(lambda x: x * 10) (7)

# Exemplo: Aplicada á 7 devolve 70


In [None]:
# The lambda functions are applied to dataframes by "apply".

In [None]:
datos_1 = {
        'id': ['Ana', 'Berto', 'Carla', 'Dani', 'Elia'],
        'aval1': [9, 8, 7, 6, 5],
        'aval2': [8, 7, 8, 4, 3]}
df_avaliacion = pd.DataFrame(datos_1)
df_avaliacion

In [None]:
# What would happen if a point were added in the second evaluation?
# We can operate on columns
df_avaliacion['aval2_extra'] = df_avaliacion['aval2'] + 1
df_avaliacion

In [None]:
# We can perform basic operations with columns, such as calculating the average of the columns.
df_avaliacion['final'] = (df_avaliacion.aval1 + df_avaliacion.aval2)/2
df_avaliacion

In [None]:
# If we want to perform more complex operations we can use lambda functions.
# For example: passes or fails
df_avaliacion['nota'] = df_avaliacion['final'].apply(lambda x: 'suspenso' if x<5 else 'aprobado')
df_avaliacion

In [None]:
# Returning to the previous example of a barplot with too many categories...

# Sometimes it can be useful to artificially construct the categories from a numerical variable...

# For example: young people, average age, seniors (low, medium, high)

In [None]:
df_titanic['age_level'] = df_titanic['age'].apply(lambda x: 'low' if x<25 else 'high' if x>50 else 'medium')
df_titanic.head()

In [None]:
# Now we can paint the graph using age_level

sns.barplot(data = df_titanic, x = 'age_level', y= 'survived')

# What explanation would you give to this situation? What did you expect?