# Eurovision
*Problem Statement? Research Question?*


Datasets we have:
1. Contest Data (includes hosting countries)
2. Contestants Data
3. Song Data 
4. Votes data
5. Betting Offices



In [6]:
# Import the different libraries
import pandas as pd

Setting up SQLAlchemy to import datasets cleaned and structured in AWS.

In [48]:
# Let's load values from the .env file
from dotenv import dotenv_values

config = dotenv_values()

# We also will need SQLAlchemy and its functions
from sqlalchemy import create_engine, types
from sqlalchemy.dialects.postgresql import JSON as postgres_json
from sqlalchemy import text # to be able to pass string

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 

# this so called "line magic" command, amongst other things, stores the plots in the notebook document.
%matplotlib inline

# warnings supression
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# import the statsmodels.api module
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [80]:
# define variables for the login
pg_user = config['POSTGRES_USER']  # align the key label with your .env file !
pg_host = config['POSTGRES_HOST']
pg_port = config['POSTGRES_PORT']
pg_db = config['POSTGRES_DB']
pg_schema = config['POSTGRES_SCHEMA']
pg_pass = config['POSTGRES_PASS']

#SQL access details
url = f'postgresql://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}'
engine = create_engine(url, echo=False)
engine.url # password is hidden
with engine.begin() as conn: 
    result = conn.execute(text(f'SET search_path TO {pg_schema};'))

## 1. Contest Data


Key stats about the Contest Data
1. Shape of df (14,7)
2. Year column 2009 - 2023. No data for 2020 because no contest duh.
3. 14 countries. 13 unique countries & 1 country >1 count


### Importing Data

In [71]:
contest_df_raw = pd.read_csv('/Users/aylaabdullah/Desktop/bootcamp/Final project local work/Alex_Ayla_Neringa/data/contest_data.csv')

### Basic df stats

In [72]:
contest_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 7 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   year                             14 non-null     int64  
 1   host                             14 non-null     object 
 2   date                             14 non-null     object 
 3   semi_countries                   14 non-null     int64  
 4   final_countries                  14 non-null     int64  
 5   jury_countries_voting_final      9 non-null      float64
 6   televote_countries_voting_final  9 non-null      float64
dtypes: float64(2), int64(3), object(2)
memory usage: 916.0+ bytes


In [73]:
contest_df_raw.shape

(14, 7)

In [74]:
contest_df_raw.describe()

Unnamed: 0,year,semi_countries,final_countries,jury_countries_voting_final,televote_countries_voting_final
count,14.0,14.0,14.0,9.0,9.0
mean,2015.714286,34.642857,25.785714,39.777778,39.888889
std,4.496641,2.205139,0.578934,2.438123,2.472066
min,2009.0,31.0,25.0,36.0,35.0
25%,2012.25,33.0,25.25,38.0,39.0
50%,2015.5,35.0,26.0,40.0,40.0
75%,2018.75,36.0,26.0,42.0,42.0
max,2023.0,38.0,27.0,43.0,43.0


In [75]:
contest_df_raw.describe(include='object')

Unnamed: 0,host,date
count,14,14
unique,13,14
top,Sweden,13/05/2023
freq,2,1


### Columns

In [76]:
contest_df_raw.dtypes

year                                 int64
host                                object
date                                object
semi_countries                       int64
final_countries                      int64
jury_countries_voting_final        float64
televote_countries_voting_final    float64
dtype: object

In [77]:
contest_df_raw

Unnamed: 0,year,host,date,semi_countries,final_countries,jury_countries_voting_final,televote_countries_voting_final
0,2023,United Kingdom,13/05/2023,31,26,37.0,38.0
1,2022,Italy,14/05/2022,35,25,40.0,40.0
2,2021,Netherlands,22/05/2021,33,26,39.0,39.0
3,2019,Israel,18/05/2019,35,26,41.0,41.0
4,2018,Portugal,12/05/2018,37,26,43.0,43.0
5,2017,Ukraine,13/05/2017,36,26,42.0,42.0
6,2016,Sweden,14/05/2016,36,26,42.0,42.0
7,2015,Austria,23/05/2015,33,27,38.0,39.0
8,2014,Denmark,10/05/2014,31,26,36.0,35.0
9,2013,Sweden,18/05/2013,33,26,,


#### Year info

In [78]:
print(f'Start year: {contest_df_raw['year'].min()}')
print(f'End year: {contest_df_raw['year'].max()}')
print(f'Years missing: {(2023-2009) - contest_df_raw['year'].nunique()}, apart from 2020 because contest was cancelled due to Covid-19')
print(f'Column type: {contest_df_raw['year'].dtype}')


Start year: 2009
End year: 2023
Years missing: 0, apart from 2020 because contest was cancelled due to Covid-19
Column type: int64


#### Host info

In [28]:
contest_df_raw['host']

0     United Kingdom
1              Italy
2        Netherlands
3             Israel
4           Portugal
5            Ukraine
6             Sweden
7            Austria
8            Denmark
9             Sweden
10        Azerbaijan
11           Germany
12            Norway
13            Russia
Name: host, dtype: object

In [45]:
print(f'List of countries and unique count: {contest_df_raw['host'].value_counts()}')
print(f'Number of countries appearing >1: {((contest_df_raw['host'].count())>1).sum()}')
print(f'Number of unique countries: {contest_df_raw['host'].nunique()}')


List of countries and unique count: host
Sweden            2
United Kingdom    1
Italy             1
Netherlands       1
Israel            1
Portugal          1
Ukraine           1
Austria           1
Denmark           1
Azerbaijan        1
Germany           1
Norway            1
Russia            1
Name: count, dtype: int64
Number of countries appearing >1: 1
Number of unique countries: 13


#### Date info:
_have to fix date column type_

#### semi_countries

#### final_countries

#### jury countries voting final

#### televote countries voting final

### Rows

In [13]:
contest_df_raw.index

RangeIndex(start=0, stop=14, step=1)

## 2. Contestants Data

### Importing Contestants_Enhanced from AWS.

1. This dataset shows information about the contestants (names, LGBTQIA+ status etc.), categorical information about the songs (name, lyrics, Youtube URL), contest information (competition placements, points/votes information).
    - year range: 1956 - 2023
2. _there are some country names missing - we are working on it_ fixed by adding UK & N. Macedonia using 'performer' column and Google Search.
3. 1734 rows, 26 columns
    - *year*: self-explanatory
    - *to_country_id*: performer's country
    - *to_country*: performer's country
    - *performer*: self-explanatory
    - *song*: self-explanatory
    - *place_contest*: _we dont use this column_
    - *sf_num*: _we dont use this column_
    - *running_final*: _we dont use this column_
    - *running_sf*: _we dont use this column_
    - *place_final*: the country & performer's overall final place in the contest _we have some null values here that need to be addressed, depending on the year and usage of this table_
    - *points_final*: same logic as place_final but this is based on the total number of points received rather than the position secured overall in the competition. _we have some null values here too._
    - *place_sf*: _we dont use this column_
    - *points_sf*: _we dont use this column_
    - *points_tele_final*: points coming in from tele voting system
    - *points_jury_final*: total points coming in from the juries
    - *points_tele_sf*: same thing for semi finals
    - *points_jury_sf*: same thing for semi finals
    - *composers*: self-explanatory
    - *lyricists*: self-explanatory
    - *lyrics*: self-explanatory
    - *youtube_url*: self-explanatory
    - *round*:  (check unique values here)
    - *country*: same as to_country; self-explanatory
    - *lgbtqia+*: part of LGBTQIa+ = 1, NOT = 0.
    - *start_year*: start year of the country's participation in the Eurovision contest
    - *last_year*: last year of the country's participation in the Eurovision contest (would be current year for all countries that haven't left the competition)



In [50]:
with engine.begin() as conn: # Done with echo=False
    result = conn.execute(text(f'''
                               SELECT * FROM contestants_enhanced; 
                                '''))
    data = result.all()

### Let's create a dataframe out of that
df_contestants_enhanced = pd.DataFrame(data) 

In [93]:
df_contestants_enhanced.head()

Unnamed: 0,year,to_country_id,to_country,performer,song,place_contest,sf_num,running_final,running_sf,place_final,...,points_jury_sf,composers,lyricists,lyrics,youtube_url,round,country,lgbtqia+,start_year,last_year
0,1956,ch,Switzerland,Lys Assia,refrain,2.0,,2.0,,2.0,...,,Georg Benz Stahl,,"(Refrain d'amour...) Refrain, couleur du ciel...",https://youtube.com/watch?v=IyqIPvOkiRk,final,Switzerland,0.0,1956,2025
1,1956,nl,Netherlands,Jetty Paerl,de vogels van holland,2.0,,1.0,,2.0,...,,Cor Lemaire,Annie M. G. Schmidt,De vogels van Holland zijn zo muzikaal Ze lere...,https://youtube.com/watch?v=u45UQVGRVPA,final,Netherlands,0.0,1956,2025
2,1956,be,Belgium,Fud Leclerc,messieurs les noyés de la seine,2.0,,3.0,,2.0,...,,Jacques Say;Jean Miret,Robert Montal,Messieurs les noyés de la Seine Ouvrez-moi les...,https://youtube.com/watch?v=U9O3sqlyra0,final,Belgium,0.0,1956,2025
3,1956,de,Germany,Walter Andreas Schwarz,im wartesaal zum großen glück,2.0,,4.0,,2.0,...,,Walter Andreas Schwarz,,"Es gibt einen Hafen, da fährt kaum ein Schiff ...",https://youtube.com/watch?v=BDNARIDnmTc,final,Germany,0.0,1956,2025
4,1956,fr,France,Mathé Altéry,le temps perdu,2.0,,5.0,,2.0,...,,André Lodge,Rachèle Thoreau,"Chante, carillon Le chant du temps perdu Chant...",https://youtube.com/watch?v=dm1L0XyikKI,final,France,0.0,1956,2025


In [95]:
df_contestants_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1734 entries, 0 to 1733
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   year               1734 non-null   int64  
 1   to_country_id      1734 non-null   object 
 2   to_country         1734 non-null   object 
 3   performer          1734 non-null   object 
 4   song               1731 non-null   object 
 5   place_contest      1678 non-null   float64
 6   sf_num             640 non-null    float64
 7   running_final      1398 non-null   float64
 8   running_sf         605 non-null    float64
 9   place_final        1397 non-null   float64
 10  points_final       1385 non-null   float64
 11  place_sf           605 non-null    float64
 12  points_sf          605 non-null    float64
 13  points_tele_final  181 non-null    float64
 14  points_jury_final  181 non-null    float64
 15  points_tele_sf     212 non-null    float64
 16  points_jury_sf     212 n

In [96]:
df_contestants_enhanced.describe()

Unnamed: 0,year,place_contest,sf_num,running_final,running_sf,place_final,points_final,place_sf,points_sf,points_tele_final,points_jury_final,points_tele_sf,points_jury_sf,lgbtqia+,start_year,last_year
count,1734.0,1678.0,640.0,1398.0,605.0,1397.0,1385.0,605.0,605.0,181.0,181.0,212.0,212.0,1734.0,1734.0,1734.0
mean,1997.103806,14.968415,1.2625,11.425608,9.859504,11.262706,75.041155,9.852893,95.004959,91.325967,91.005525,67.575472,67.575472,0.054787,1972.464245,2023.436563
std,18.935907,10.470999,0.707549,6.718587,5.66958,6.739982,86.779039,5.661229,73.292585,95.126517,77.557255,50.563263,47.545328,0.227629,18.571974,5.301881
min,1956.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1956.0,1980.0
25%,1982.0,7.0,1.0,6.0,5.0,5.0,16.0,5.0,41.0,21.0,35.0,24.75,26.75,0.0,1957.0,2025.0
50%,2002.0,13.0,1.0,11.0,10.0,11.0,51.0,10.0,74.0,55.0,69.0,53.0,58.0,0.0,1961.0,2025.0
75%,2013.0,21.0,2.0,17.0,14.0,16.0,100.0,14.0,133.0,134.0,129.0,104.0,98.25,0.0,1993.0,2025.0
max,2023.0,43.0,2.0,27.0,28.0,27.0,758.0,28.0,403.0,439.0,382.0,204.0,222.0,1.0,2015.0,2025.0


In [98]:
print(f'A list of columns in this dataset: {df_contestants_enhanced.columns}')
print(f'Shape of the dataset: {df_contestants_enhanced.shape}')

A list of columns in this dataset: Index(['year', 'to_country_id', 'to_country', 'performer', 'song',
       'place_contest', 'sf_num', 'running_final', 'running_sf', 'place_final',
       'points_final', 'place_sf', 'points_sf', 'points_tele_final',
       'points_jury_final', 'points_tele_sf', 'points_jury_sf', 'composers',
       'lyricists', 'lyrics', 'youtube_url', 'round', 'country', 'lgbtqia+',
       'start_year', 'last_year'],
      dtype='object')
Shape of the dataset: (1734, 26)


## 3. Song Data

## 4. Votes Data

## 5. Betting Offices 

### Importing mart_betting from AWS.

1. This shows average betting odds per country for each of the years between 2015 & 2023. 
2. _there are some country names missing - we are working on it_ fixed by adding UK & N. Macedonia using 'performer' column and Google Search.
3. 350 rows, 3 columns (year, country_name, betting_odds)
4. Betting odds summary:
    - min: ~1.136
    - mean: ~78.77
    - max: ~544.563
    - Q1: ~5.465
    - Q2: 34.361
    - Q3: ~120.606




In [81]:
with engine.begin() as conn: # Done with echo=False
    result = conn.execute(text(f'''
                               SELECT * FROM mart_betting; 
                                '''))
    data = result.all()

### Let's create a dataframe out of that
df_mart_betting = pd.DataFrame(data) 

In [82]:
df_mart_betting.describe()

Unnamed: 0,year,betting_odds
count,350.0,350.0
mean,2018.782857,78.76974
std,2.494405,101.223246
min,2015.0,1.136
25%,2017.0,5.464504
50%,2019.0,34.361389
75%,2021.0,120.606365
max,2023.0,544.5625


In [83]:
df_mart_betting.info()

# we are missing some country_names. These will be filled in.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   year          350 non-null    int64  
 1   country_name  341 non-null    object 
 2   betting_odds  350 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 8.3+ KB


In [84]:
df_mart_betting.shape

(350, 3)

### Here's a pretty box plot of Betting_Odds summary stats

In [85]:
df = px.data.tips()
fig = px.box(df_mart_betting, y="betting_odds")
fig.show()

### Importing prep_betting from AWS.

1. This table shows 
    - betting scores for each performer/song, per country, per year (2015-2023)
    - includes song URLs
2. ~9k rows, 11 columns
    - *betting_bm_id*: This is an internal unique identifier for the bookmaker used in our dataset (e.g; 5 = BET365) -> this will stay the same for all countries
    - *betting_sc_id*: _we don't want to get into this; we refuse to believe that this datapoint it useful for us_
    - *betting_name*: Name of the betting company
    - *betting_score*: The betting score given to the performer for that particular year (detailed on song & country name too)
    - *year*: self-explanatory
    - *performer*: self-explanatory
    - *song*: self-explanatory
    - *page_url*: the link to the Eurovision song
    - *contest_round*: 3 unqiue values
        - final -> this is the one we will be focusing on. 
        - semi_final_1
        - semi_final_2
    - *country_name*: self-explanatory
    - *country_code*: there's country_names here, we don't want to bother with country_codes. Suck it up. 
3. Betting score summary stats:
    - min: 1.0
    - mean: ~84.9
    - max: 1001.0
    - Q1: 2.5
    - Q2: 26.0
    - Q3: 101.0

In [86]:
with engine.begin() as conn: # Done with echo=False
    result = conn.execute(text(f'''
                               SELECT * FROM prep_betting; 
                                '''))
    data = result.all()

### Let's create a dataframe out of that
df_prep_betting = pd.DataFrame(data) 

In [87]:
df_prep_betting.head()

Unnamed: 0,betting_bm_id,betting_sc_id,betting_name,betting_score,year,performer,song,page_url,contest_round,country_name,country_code
0,24,12,MATCHBOOK,2.21,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,Sweden
1,2,-200,BETFAIR*EXCHANGE,2.2,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,Sweden
2,5,220,BET365,3.75,2015,Polina Gagarina,A Million Voices,/eurovision/2015/russia,final,Russia,Russia
3,4,153,UNIBET,4.5,2015,Polina Gagarina,A Million Voices,/eurovision/2015/russia,final,Russia,Russia
4,18,139,YOUWIN,4.0,2015,Polina Gagarina,A Million Voices,/eurovision/2015/russia,final,Russia,Russia


In [88]:
df_prep_betting.describe()

Unnamed: 0,betting_bm_id,betting_sc_id,betting_score,year
count,9453.0,9453.0,9407.0,9453.0
mean,20.927854,83.820903,84.904543,2018.97641
std,14.381842,107.033836,141.725063,2.248565
min,1.0,-1000.0,1.0,2015.0
25%,11.0,69.0,2.5,2017.0
50%,19.0,102.0,26.0,2019.0
75%,30.0,131.0,101.0,2020.0
max,53.0,220.0,1001.0,2023.0


In [91]:
df_prep_betting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9453 entries, 0 to 9452
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   betting_bm_id  9453 non-null   int64  
 1   betting_sc_id  9453 non-null   int64  
 2   betting_name   9453 non-null   object 
 3   betting_score  9407 non-null   float64
 4   year           9453 non-null   int64  
 5   performer      9453 non-null   object 
 6   song           9453 non-null   object 
 7   page_url       9453 non-null   object 
 8   contest_round  9453 non-null   object 
 9   country_name   9453 non-null   object 
 10  country_code   9129 non-null   object 
dtypes: float64(1), int64(3), object(7)
memory usage: 812.5+ KB


In [89]:
df_prep_betting.describe(include='object')

Unnamed: 0,betting_name,performer,song,page_url,contest_round,country_name,country_code
count,9453,9453,9453,9453,9453,9453,9129
unique,33,329,348,354,3,45,43
top,BOYLESPORTS,Go_A,You,/eurovision/2020/netherlands,final,Sweden,Sweden
freq,576,92,72,57,4550,290,290


In [38]:
print(f'The Shape of the dataset: {df_prep_betting.shape}')
print(f'A list of columns in the dataset: {df_prep_betting.columns}')

The Shape of the dataset: (9453, 11)
A list of columns in the dataset: Index(['betting_bm_id', 'betting_sc_id', 'betting_name', 'betting_score',
       'year', 'performer', 'song', 'page_url', 'contest_round',
       'country_name', 'country_code'],
      dtype='object')


### Here's a pretty box plot of Betting_Score summary stats

In [68]:
df = px.data.tips()
fig = px.box(df_prep_betting, y="betting_score")
fig.show()

In [69]:
df = px.data.tips()
fig = px.histogram(df_prep_betting, y="betting_score", nbins=30, marginal='box', title = 'Distribution of betting scores')
fig.show()

## Extra fancy shmancy stuff

Here we wanted to show box plots of betting scores for the top countries in the finals

In [54]:
df_prep_betting_on_contestants = pd.merge(right = df_contestants_enhanced, left = df_prep_betting, on=['year','performer'])
df_prep_betting_on_contestants.head()

Unnamed: 0,betting_bm_id,betting_sc_id,betting_name,betting_score,year,performer,song_x,page_url,contest_round,country_name,...,points_jury_sf,composers,lyricists,lyrics,youtube_url,round,country,lgbtqia+,start_year,last_year
0,5,220,BET365,2.1,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,...,,Anton Hård af Segerstad;Joy Deb;Linnea Deb,,Don't tell the gods I left a mess I can't undo...,https://youtube.com/watch?v=5sGOwFVUU0I,final,Sweden,0.0,1958,2025
1,4,153,UNIBET,2.0,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,...,,Anton Hård af Segerstad;Joy Deb;Linnea Deb,,Don't tell the gods I left a mess I can't undo...,https://youtube.com/watch?v=5sGOwFVUU0I,final,Sweden,0.0,1958,2025
2,18,139,YOUWIN,2.38,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,...,,Anton Hård af Segerstad;Joy Deb;Linnea Deb,,Don't tell the gods I left a mess I can't undo...,https://youtube.com/watch?v=5sGOwFVUU0I,final,Sweden,0.0,1958,2025
3,15,131,BOYLESPORTS,2.25,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,...,,Anton Hård af Segerstad;Joy Deb;Linnea Deb,,Don't tell the gods I left a mess I can't undo...,https://youtube.com/watch?v=5sGOwFVUU0I,final,Sweden,0.0,1958,2025
4,21,124,CORAL,2.1,2015,Måns Zelmerlöw,Heroes,/eurovision/2015/sweden,final,Sweden,...,,Anton Hård af Segerstad;Joy Deb;Linnea Deb,,Don't tell the gods I left a mess I can't undo...,https://youtube.com/watch?v=5sGOwFVUU0I,final,Sweden,0.0,1958,2025


In [60]:
df_prep_betting_on_contestants_top3 = df_prep_betting_on_contestants[df_prep_betting_on_contestants['place_final'].isin(range(1,4))]

In [61]:
df_prep_betting_on_contestants_top3['place_final'].unique()

array([1., 2., 3.])

In [66]:
fig = px.box(df_prep_betting_on_contestants_top3, x="year", y="betting_score", color = 'country_name', title='Box plots of betting_scores for top 3 countries')
fig.show()

In [92]:
fig = px.bar(df_prep_betting_on_contestants_top3, x = 'year', y = 'betting_score', color = 'betting_name', title = 'Bargraph of betting scores over the years for top 3 finalists')
fig.show()