# 12. Visualisation

## Exercise 12.1

In exercise 11.3, you created a CSV file named 'prices_of_coffee_over_time.csv', containing data about the average price of a pound of coffee on a range of dates. Use this CSV file to create a line chart which visualises the development of these prices over time. 

In [None]:
import pandas as pd

df = pd.read_csv('prices_of_coffee_over_time.csv')

df = df.sort_values(by=['date'])

%matplotlib inline

import matplotlib.pyplot as plt

plt.style.use('ggplot')

fig = plt.figure( figsize = ( 16, 4))
ax = plt.axes()

ax.plot( df['date'] , df['price_per_pound'] , color = '#930d08' , linestyle = 'solid')

ax.set_xlabel('Date')
ax.set_ylabel('Price per pound')

plt.xticks(rotation= 90)

ax.set_title( 'VOC Coffee auctions')

plt.show()

## Exercise 12.2

Download the following data set:

https://edu.nl/bcm4x

This file contains data collected for the [2018 Better Life Index](https://stats.oecd.org/index.aspx?DataSetCode=BLI), which was created by the OECD to visuale some of the key factors  that contribute to well-being in OECD countries, including education, income, housing and environment.

Using this data set, create a bar chart which can be used to compare either the 'personal_earnings' or the 'life_satisfaction' in OECD countries. 

In [None]:
import pandas as pd

df = pd.read_csv('bli.csv')
df = df.dropna(subset = [ 'personal_earnings' ])


colours = [ '#DD7373' , '#3B3561' , '#EAD94C' , '#9E1946' , '#c9c4b5'  ]
classColours = dict()

unique_categories = list( set( df['continent'] ) )
if len( unique_categories ) <= len(colours):
    for u in range( len( unique_categories ) ):
        classColours[ unique_categories[u] ] = colours[u]
else:
    print("You have more than {} categories. You need to add colours to the list!".format( len(colours) ))

colours = []
for category in df['continent']:
    colours.append( classColours[category] )
    

%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd

y_axis = 'personal_earnings'


fig = plt.figure( figsize=( 14 , 10 ) )
ax = plt.axes()


bar_width = 0.45
opacity = 0.8

ax.bar( df['country'] , df[y_axis] , width = bar_width, alpha = opacity , color = colours)

plt.xticks(rotation= 90)


patchList = []
for key in classColours:
    data_key = mpatches.Patch(color=classColours[key], label=key)
    patchList.append(data_key)
    
plt.legend(handles=patchList , shadow=True, fontsize='large' , frameon = True )
#plt.ylim(0, 10)

ax.set_xlabel('Countries' , fontsize= 12)
ax.set_ylabel( y_axis , fontsize = 12 )
ax.set_title( y_axis , fontsize=20 )


plt.show()

Using seaborn, the bar chart can be created as follows.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

colours = [ '#DD7373' , '#c9c4b5' , '#3B3561' , '#EAD94C' , '#9E1946'  , '#51A3A3'  ]

## and also adds spacing in between the lines of the legend 
sns.set(style='whitegrid', rc = {'legend.labelspacing': 1})


df = pd.read_csv('bli.csv')
df = df.dropna(subset = [ 'personal_earnings' ])

fig = plt.figure( figsize=( 14 , 10 ) )

ax = sns.barplot( x = 'country' , y= 'personal_earnings' , data =  df , hue = 'continent' , dodge=False , palette = colours )

plt.xticks(rotation= 90)

# this next line makes sure that the legend is shown outside of the graph
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

## Exercise 12.3

Using the CSV file that you have downloaded for exercise in 12.2 to create a scatter plot. The X-axis must visualise the values in the column 'self-reported_health', and the Y-axis must show the 'employment_rate'. The size of the points must represent the 'educational_attainment' and the colour of the points ust indicate the 'air_pollution'. 

N.B. Names of exiting colour palettes can be found at [https://python-graph-gallery.com/101-make-a-color-palette-with-seaborn/](https://python-graph-gallery.com/101-make-a-color-palette-with-seaborn/) or at [https://chrisalbon.com/python/data_visualization/seaborn_color_palettes/](https://chrisalbon.com/python/data_visualization/seaborn_color_palettes/). Examples include: "Blues", "BuGn", "YlOrRd", "GnBu", "OrRd", "Greens", "Reds".

In [None]:
import pandas as pd
df = pd.read_csv('bli.csv')

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

x_axis = 'self-reported_health'
y_axis =  'employment_rate'
point_size =  'educational_attainment'
point_colour =  'air_pollution'


df = df.dropna(subset = [x_axis, y_axis])

#air_pollution,water_quality
fig = plt.figure( figsize = ( 10,10 ))
ax = plt.axes()


scatter = ax.scatter( df[x_axis] , df[y_axis] , alpha=0.8,  s= df[point_size] * 10  , c = df[point_colour] * 10, cmap='Reds' )


for index, row in df.iterrows():
    plt.text( row[x_axis], row[y_axis] , row['country'] , fontsize=12.8)
    

ax.set_xlabel( x_axis  , fontsize = 16 )
ax.set_ylabel( y_axis  , fontsize = 16 )
ax.set_title( 'OECD Better Life Index' , fontsize=24 )


plt.show()

Using seaborn:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('bli.csv')
df = df.dropna(subset = [x_axis, y_axis])

fig = plt.figure( figsize = ( 10,10 ))

plt.style.use('seaborn-whitegrid')

x_axis = 'self-reported_health'
y_axis =  'employment_rate'
point_size =  'educational_attainment'
point_colour = 'air_pollution' 

## and also adds spacing in between the lines of the legend 
sns.set(style='whitegrid', rc = {'legend.labelspacing': 1.6})


ax = sns.scatterplot(x = x_axis , y = y_axis  , data=df, hue= point_colour , palette="Greens" , size = point_size , sizes=( 100 , 1000) )

for index, row in df.iterrows():
    plt.text( row[x_axis], row[y_axis] , row['country'] , fontsize=12.8)


ax.set_xlabel( 'Self reported health'  , fontsize = 16 )
ax.set_ylabel( 'Employment rate'  , fontsize = 16 )
ax.set_title( 'OECD Better Life Index' , fontsize=24 )



# this next line makes sure that the legend is shown outside of the graph
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);


## Exercise 12.4

PISA is the OECD's [Programme for International Student Assessment](https://www.oecd.org/pisa/). This programme evaluates educational systems globally by measuring the performance of 15 year-old-children in mathematics, science and reading. The latest study is from 2018. 

The CSV file '[pisa.csv](https://edu.nl/p97ma)' contains all the scores measured for mathematics and reading in between 2000 and 2018.  

Using Pandas, Matplotlib and Seaborn, create visualisations which can help to answer the following questions:

1. How did the various countries that were examined in 2018 perform? Which contries had the highest scores, and which countries had the lowest scores? How did the score of the Netherlands compare to those of other countries? You can limit the analyses to the the 'total' scores (i.e. those records in which column 'object' has value 'TOT'). 

2. Were the scores for reading correlated to the scores for mathematics in 2018? Answer this question by creating a scatter plot. 

3. How did the scores for reading develop in the Netherlands in between 2000 and 2018? Focus on the score for boys and for girls separately. 

4. Have the scores remained relatively stable over the years if we look at the total scores? Or has there been some variation? How does the variation of the scores for the Netherlands compare to the scores in France, Germany, Belgium and Luxembourg? Try to answer this question by crearing a boxplot. 

Firstly, download the CSV file and read its contents using `read_csv` method from pandas. 

In [None]:
import pandas as pd

df = pd.read_csv('pisa.csv')


Next, create a new data frame containing the total scores measured in the year 2018. We can 'subset' the dataframe using square brackets. These brackets should contain a criterium that can serve as a filter.  

the `sort_values()` method can be used to place all the rows oif the data frame in a certain order. 

In [None]:
df_2018 = df[ df['subject'] == 'TOT' ]
df_2018 = df_2018[ df_2018['year'] == 2018 ]

df_2018 = df_2018.sort_values('pisa_read')

Next, we can plot the values in this new data frame using `Seaborn`. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set_style('darkgrid')
plt.figure( figsize = ( 10,7 ))

# The parameter 'x' specifies the values that will 
# be shown on the X-axis
# The parameter 'y' specifies the values that will 
# be shown on the Y-axis
## 'hue' determines the colours of the bars. It can be 
# connected to one of the variables in the
# data frame


sns.barplot( data = df_2018 , x = 'location_name' , y = 'pisa_read' , hue = 'continent' , dodge = False  , ci = None)

plt.legend( bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks( rotation = 90 )

plt.show() 


To examine the correlation between `pisa_read` and `pisa_math`, we can visualise these two variables using a scatter plot. 

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


plt.figure( figsize = ( 6,6 ))


sns.set(style='whitegrid', rc = {'legend.labelspacing': 0.6})

sns.scatterplot(x = 'pisa_read' , y = 'pisa_math' , 
                data = df_2018 , hue = 'continent' , s = 100 ) 

## The next few lines demonstrate the code tha
# can be used to annotate a plot. 
# You can place text on the plot using `plt.text()`
# The code below only labels the dot that represents the Netherlands

country = 'Netherlands'
nl = df_2018[ df_2018['location_name'] == country ] 
plt.text( int ( nl['pisa_read'] ) -20, int ( nl['pisa_math'] ) +3 , country , size = '14', alpha=0.9 )

        
# this next line makes sure that the legend is shown outside of the graph
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

plt.savefig('scatter.jpg')



The code below creates a new data frame, based on the original data frame that was created out of the original CSV file. It firstly select the rows in which the `subject` column either has 'BOY' or 'GIRL' as a value. The second line selects the scores for the Netherlands. 

In [None]:
df_nl = df[ df['subject'].isin( ['BOY','GIRL']  ) ]
df_nl = df_nl[ df_nl['location_name'] == 'Netherlands' ]

The newly created dataframe `df_nl` now contains all the values for Dutch boys and girls, measured in between 2000 and 2018. These values can be plotted as a line chart, using the `lineplot()` method in `Seaborn`. When you add a parameter named `hue`, pointing to one of the variables in the dataframe, different lines will be drawn for all the unique values in this particular column. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

plt.figure( figsize = ( 6,4 ))
ax = plt.axes()



df_nl = df[ df['location_name'] == 'Netherlands' ]



#plt.style.use('seaborn-whitegrid')

sns.lineplot( data = df_nl , x = 'year' , y = 'pisa_math' ,  hue = 'subject' , palette = [ '#078217' , '#f5ce4e' , '#3242a8'] ,  linewidth = 3 )

sns.set(style='whitegrid', rc = {'legend.labelspacing': 1})

# this next line makes sure that the legend is shown outside of the graph
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);


ax.set_xlabel('Years' , size = 16 )
ax.set_ylabel('Scores for reading performance' , size = 16 )
ax.set_title( 'PISA scores in the Netherlands' , size = 28 )


plt.savefig('math.png')

A box plot can be created using the `boxplot()` function.

In [None]:
%matplotlib inline

colours = [  '#a88732' ,  '#265c28' , '#a0061a' ,  '#431670' ]

countries = [ 'Netherlands', 'France' , 'Germany' , 'Belgium', 'Luxembourg']
df_countries = df[ df['location_name'].isin(countries) ]

import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(data= df_countries , x = 'location_name' , y = 'pisa_math' , palette = colours  );