![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Soccer Analytics

Welcome to a Jupyter notebook on soccer analytics. This notebook is a free resource and is part of the Callysto project, which brings data science skills to grades 5 to 12 classrooms. 

In this notebook, we answer the question: How do ball possession and scoring relate?


Visualizations will be coded using Python, a computer programming language. Python contains words from English and is used by data scientists. Programming languages are how people communicate with computers.

“Run” the cells to see the graphs.
Click “Cell” and select “Run All.” This will import the data and run all the code to create the data visualizations (scroll back to the top after you’ve run the cells).   

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

In [None]:
# import Python libraries

import pandas as pd
import plotly.express as px

### Making a csv file
Data source: https://www.uefa.com/uefachampionsleague/standings/
<br>Data was collected for the group phase (6 games per team) for the 2020-2021 season from the Champions League website. The data was inputted into the cell below by reading tables on the website. Notice that the values are separated by commas; this format is needed for the computer to read the data. The `writefile` command is used to create the file.

In [None]:
%%writefile possession.csv
Total goals,Goal difference,Average ball possession (%),Team
18,16,61,Bayern
16,11,57,Barcelona
16,7,44,Monchengladbach
15,5,50,Man. United
14,12,54,Chelsea
14,10,51,Juventus
13,12,59,Man. City
13,7,54,Paris
12,7,56,Dortmund
11,2,58,Real Madrid
11,-1,51,Leipzig
11,4,47,Lazio
10,7,53,Liverpool
10,7,41,Porto
10,-7,48,RB Salzburg
10,2,47,Atalanta
9,1,57,Sevilla
8,-2,51,Club Brugge
7,0,55,Ajax
7,-2,51,Inter Milan
7,-1,50,Atletico Madrid
7,-11,45,Istanbul Basaksehir
6,-5,40,Krasnodar
5,-12,47,Ferencvaros
5,-7,47,Shakhtar Donetsk
5,-5,42,Lokomotiv Moskva
4,-9,47,Zenit
4,-9,46,Midtjylland
4,-9,45,Dynamo Kyiv
3,-8,50,Rennes
2,-8,50,Olympiacos
2,-11,50,Marseille

The Python library pandas is used to tell the computer to read and then display the data in a table, or dataframe. Pandas is a library used to organize data. The dataframe below is organized from most to least total goals per team.

In [None]:
possession_df = pd.read_csv('possession.csv')
possession_df.sort_values('Average ball possession (%)', ascending=False)

Since we are exploring how possession and scoring relate, let's calculate some measures of spread and central tendency on average ball possession (%) to better understand the shape of the data.

In [None]:
# Compute min, max, range, mean and median
# Min average ball possession
min_df = possession_df['Average ball possession (%)'].min() # change to 'Total goals' or 'Goal difference' to for different calculations
# Max average ball possession
max_df = possession_df['Average ball possession (%)'].max()
# Range average ball possession
range_df = (possession_df['Average ball possession (%)'].max()) - (possession_df['Average ball possession (%)'].min())
# Mean of average ball possession
mean_df = possession_df['Average ball possession (%)'].mean()
# Median of average ball possession
median_df = possession_df['Average ball possession (%)'].median()
# Print results
print("The minimum value is", min_df)
print("The maximum value is", max_df)
print("The range is", range_df)
print("The mean is", mean_df)
print("The median is", median_df)

Notice that the mean and median are 50, and the range is 21. 

You can update or change the code. Follow the directions after the # in the code cell above.

Now, let's visualize the range with a bar graph.

In [None]:
bar_df = px.bar(possession_df, 
                x='Team', 
                y='Average ball possession (%)', # change y to Total goals or Goal difference to visualize different variables
                title='Average ball possession (%) by team') # update title, if needed
bar_df.update_layout(xaxis={'categoryorder':'total descending'})

Notice that the x-axis represents teams, and the y-axis represents average ball possession (%). Bayern has the highest average ball possession at 60%, and Krasnodar has the lowest at 40%. Marseille, Olympiacos, Atletico Madrid, and RB Salzburg all have ball possession of 50%, which is the mean and the median. These measures of central tendency can help us divide the dataset into teams with more ball possession and teams with less ball possession.

Now that we've explored the centre and spread of average ball possession (%), let's examine how average ball possession (%) relates to total goals. The scatter plot displays average ball possession (%) on the x-axis and total goals on the y-axis. Total goals range from Marseille with 2 to Bayern with 18. Hover over the data points to view more information.

In [None]:
scatter_total_df = px.scatter(possession_df,
                    x="Average ball possession (%)", 
                    y="Total goals", # change y to Goal difference
                    hover_data=["Team"],
                    trendline="ols",
                    title="Relationship between average ball possession (%) and total goals")
scatter_total_df.show()

Notice that the line of best fit indicates a positive trend with total goals increasing with average ball possession. 

Hover over the data points to find out more information. The data points further from the line seem to tell a different story. Bayern has the highest ball possession at 61% and the most total goals at 18. Marseille, on the other hand, has the least amount of total goals at 2 with ball possession of 50%.

While total goals can help understand how successful teams are, the idea of possession involves keeping the ball to score and keeping the ball to prevent the other team from scoring. It might be interesting to explore the relationship between average ball possession and goal difference. 

Goal difference is the addition of total goals scored minus goals that other teams have scored against the team. The scatter plot below visualizes the relationship between average ball possession (%) and goal difference by team. The goal difference on the y-axis contains negative values; the negative values mean that a team has more goals scored against than more goals scored. Hover over the data points to view more information.

In [None]:
scatter_difference_df = px.scatter(possession_df,
                    x="Average ball possession (%)", 
                    y="Goal difference",
                    size="Total goals",
                    color="Team",
                    title="Relationship between average ball possession (%) and goal difference by team")
scatter_difference_df.show()

Notice that Bayern leads in ball possession at 61% as well as in both total goals at 18 with a goal difference of 16 -- that means only 2 goals were scored against Bayern within the 6 games prior to knock-outs. 

Ferencvaros has the lowest goal difference of -12 and ball possession of 47%. Marseille with the lowest total goals of 2 has the second lowest goal difference of -11 and ball possession 50% of game play. 

In [None]:
# This cell prevents the next section from running automatically
%%script false

In [None]:
#❗️Run this cell with Shift+Enter
import interactive as i
i.challenge1()

In [None]:
#❗️Run this cell with Shift+Enter
import interactive as i
i.challenge2()

In [None]:
#❗️Run this cell with Shift+Enter
import interactive as i
i.challenge3()

To reset the last three interactive questions, select Kernel and then Restart & Clear Output from the menu.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)