## UCL Winners Exploratory Data Analysis

Since we are close to knowing who will be the winner this year of the most prestigious European tournament at club's level aka UEFA Champions League, let's perform a recap on who where the teams who won most editions of this competition throughout history.

We will use for this analysis data obtained from [Wikipedia](https://en.wikipedia.org/wiki/UEFA_Champions_League) which contains data for the finals of the European club championship since its inception in 1955. I have already parsed and work on this [data](https://github.com/fvgm-spec/csv_files/blob/master/uefa_champions_winners.csv), as well updated with current records, and you can find the data available in my GitHub repo, as well as the [notebook](https://github.com/fvgm-spec/Data_Science_Projects/blob/master/UEFA_winners_Analysis.ipynb) of this analysis.

### Data Acquistion

This dataset, obtained from Wikipedia, contains data for the finals of the European club championship since its inception in 1955. For reference, you can go to this [link] (http://en.wikipedia.org/wiki/UEFA_Champions_League) 

First we will create a new folder to store the dataset in our root setting a condition if the path does not exist, we'll create it, by using the _os_ package. Then we will use _urllib_ package to download the csv file from *base_url*. Then we'll store it in the folder we previously created, before that we'll make sure that the file does not exist.

Once get get our dataset, we;ll convert it into a DataFrame by using *read_csv* command

In [1]:
# Importing urlib
import urllib
import pandas as pd
import os

# Creating the data folder
if not os.path.exists('./data'):
    os.makedirs('./data')

# Obtaining the dataset using the url that hosts it
base_url = 'https://github.com/fvgm-spec/csv_files/blob/master/uefa_champions_winners.csv'
if not os.path.exists('./data/uefa.csv'):     # avoid downloading if the file exists
    response = urllib.request.urlretrieve(base_url, './data/uefa.csv')

In the output of the previous code we get a DataFrame with 64 rows that shows the data for all Champions League finals from the beginning of the competition won by Real Madrid until the last edition won by Bayern Munich.

In [2]:
df = pd.read_csv('D:/Dev/data/csv/uefa_champions_winners.csv',encoding='utf-8')
df

Unnamed: 0,Season,Nation,Winners,Score,Runners-up,Runner-UpNation,Venue,Attendance
0,1955–56,Spain,Real Madrid,4–3,Stade de Reims,France,"Parc des Princes,Paris",38239
1,1956–57,Spain,Real Madrid,2–0,Fiorentina,Italy,"Santiago Bernabéu Stadium, Madrid",124000
2,1957–58,Spain,Real Madrid,3–2,Milan,Italy,"Heysel Stadium,Brussels",67000
3,1958–59,Spain,Real Madrid,2–0,Stade de Reims,France,"Neckarstadion,Stuttgart",72000
4,1959–60,Spain,Real Madrid,7–3,Eintracht Frankfurt,Germany,"Hampden Park,Glasgow",127621
...,...,...,...,...,...,...,...,...
60,2015–16,Spain,Real Madrid,1–1*[J],Atletico Madrid,Spain,Giuseppe Meazza,71942
61,2016–17,Spain,Real Madrid,4–1,Juventus,Italy,Cardiff Stadium,65842
62,2017–18,Spain,Real Madrid,4–1,Liverpool,England,Olimpiyskiy NCS Stadium,61561
63,2018–19,England,Liverpool,2–0,Tottenham,England,Wanda Metropolitano,63272


### Grouping data with Pandas

The group by clause is an operation on DataFrames. A Series is a 1D object, so performing a group by operation on it is not very useful. However, it can be used to obtain distinct rows of the Series. The result of a group by operation is not a DataFrame but dict of DataFrame objects.

Thus, the output saw above shows the season, the nations to which the winning and runner-up clubs belong, the score, the venue, and the attendance figures. Suppose we wanted to rank the nations by the number of European club championships they had won. We can do this by using groupby. First, we apply groupby to the DataFrame and see what is the type of the result:

In [3]:
nationsGrp =df.groupby(['Nation'])
nationsGrp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024F4FD52CD0>

We notice that **nationsGrp** is an object type called _pandas.core.groupby.DataFrameGroupBy_. The column on which we use groupby is referred to as the key, and the rest of the values corresponding to those keys, are the ones inside each of them, which is merely an object called _dictionary_. We can see what the groups look like by using the groups attribute on the resulting _DataFrameGroupBy_ object:

In [4]:
nationsGrp.groups

{'England': [12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56, 63], 'France': [37], 'Germany': [18, 19, 20, 27, 41, 45, 57, 64], 'Italy': [7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], 'Netherlands': [14, 15, 16, 17, 32, 39], 'Portugal': [5, 6, 31, 48], 'Romania': [30], 'Scotland': [11], 'Spain': [0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55, 58, 59, 60, 61, 62], 'Yugoslavia': [35]}

As we told before, this is basically a dictionary that shows the unique groups and the axis labels corresponding to each group—in this case the row number. We can get for example whole information of the 62th index of the DataFrame, that corresponds to 2017-18 season final, which represented the 13th "Orejona" for Real Madrid disputed against Liverpool.

The number of groups is obtained by using the *len()* function in the cell below:

In [5]:
Spain = df.iloc[62]
Spain

Season                             2017–18
Nation                               Spain
Winners                        Real Madrid
Score                                  4–1
Runners-up                       Liverpool
Runner-UpNation                    England
Venue              Olimpiyskiy NCS Stadium
Attendance                           61561
Name: 62, dtype: object

In [6]:
len(nationsGrp.groups)

10

Here the data we grouped previously determined by `DataFrameGroupBy object`, identified with the variable name _nationsGrp_, we'll use it to display some tables, but first we need to convert it to DataFrame, so we can create a new mesure and sort it ascending.

In the table we note that the Nation with more wins in Champions is Spain, mostly due to the 13 a 5 Trophys from Real Madrid and Barcelona.

In [7]:
nationWins=nationsGrp.size().to_frame('Champion')
NationsWinners=nationWins.sort_values(by='Champion', ascending=False)
NationsWinners

Unnamed: 0_level_0,Champion
Nation,Unnamed: 1_level_1
Spain,18
England,13
Italy,12
Germany,8
Netherlands,6
Portugal,4
France,1
Romania,1
Scotland,1
Yugoslavia,1


The _size()_ function returns a Series with the group names as the index and the size of each group. The _size()_ function is also an aggregation function.

To do a further breakup of wins by country and club, we apply a **multicolumn groupby function** and then size() and sort():

In [8]:
winners = df.groupby(['Nation','Winners']).size().to_frame('Champion')
winnersUEFA = winners.sort_values(by='Champion', ascending=False)
winnersUEFA

Unnamed: 0_level_0,Unnamed: 1_level_0,Champion
Nation,Winners,Unnamed: 2_level_1
Spain,Real Madrid,13
Italy,Milan,7
England,Liverpool,6
Germany,Bayern Munich,6
Spain,Barcelona,5
Netherlands,Ajax,4
England,Manchester United,3
Italy,Internazionale,3
Portugal,Benfica,2
England,Nottingham Forest,2


A **multicolumn groupby** specifies more than one column to be used as the key by specifying the key columns as a list. Thus, we can see that the most successful club in this competition has been Real Madrid of Spain.

We have quikly explored through this tutorial some ways we can used data analisys techniques, especifically one well known between data analysts which is _groupby_. You can also find much more other Pandas techniques to analyze in the official [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

Now we will examine a richer dataset that will enable us to illustrate many more features of groupby. This dataset is also soccer related and provides statistics for the top four European soccer leagues in the 2012-2013 season:
* English Premier League or EPL
* Spanish Primera Division or La Liga
* Italian First Division or Serie A
* German Premier League or Bundesliga

The source of this information is at http://soccerstats.com.
Let us now read the goal stats data into a DataFrame as usual. In this case, we create a row index on the DataFrame using the _Month_ column.

In [9]:
goalStatsDF=pd.read_csv('C:/Data/csv/goals_stats_euro_leagues.csv')
goalStatsDF=goalStatsDF.set_index('Month')

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Data/csv/goals_stats_euro_leagues.csv'

At first let's take a snapshot of the head and tail ends of our dataset:

In [None]:
goalStatsDF.head(3)

In [None]:
goalStatsDF.tail(3)

There are two measures in this data frame _MatchesPlayed_ and _GoalsScored_ and the data is ordered first by Stat and then by Month. Note that the last row in the tail() output has the NaN values for all the columns except La Liga but we'll discuss this in more detail later. We can use groupby to display the stats, but this will be done by grouped year instead. Here is how this is done:

In [None]:
goalStatsGroupedByYear = goalStatsDF.groupby(lambda Month: Month.split('/')[2])
goalStatsGroupedByYear

We can then iterate over the resulting _DataFrameGroupBy object_ and display the groups. In the following command, we see the two sets of statistics grouped by year. Note the use
of the lambda function to obtain the year group from the first day of the month.

In [None]:
for name, group in goalStatsGroupedByYear:
    print(name)
    print(group)

If we wished to group by individual month instead, we would need to apply groupby with a level argument, as follows:

In [None]:
goalStatsGroupedByMonth = goalStatsDF.groupby(level=0)

In [None]:
for name, group in goalStatsGroupedByMonth:
    print(name)
    print(group)
    print("\n")

Note that since in the preceding commands we're grouping on an index, we need to specify the level argument as opposed to just using a column name. When we group by multiple keys, the resulting group name is a tuple, as shown in the upcoming commands. First, we reset the index to obtain the original DataFrame and define a MultiIndex in order to be able to group by multiple keys.

In [None]:
goalStatsDF=goalStatsDF.reset_index()
goalStatsDF=goalStatsDF.set_index(['Month','Stat'])

In [None]:
monthStatGroup=goalStatsDF.groupby(level=['Month','Stat'])

In [None]:
for name, group in monthStatGroup:
    print(name)
    print(group)