In [1]:
# run this cell
import pandas as pd
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Final Project: Intro to Soccer

### Background Knowledge <a id='section 0'></a>


Nothing frustrates both soccer fans and players as much as being [red-carded](https://en.wikipedia.org/wiki/Penalty_card#Red_card). In soccer, receiving a red card from the referee means that the player awarded the red card is expelled from the game. Consequently his team must play with one fewer player for the remainder of the game.

Due to the inherently subjective nature of referees' judgments, questions involving the fairness of red card decisions crop up frequently, especially when soccer players with darker complexions are red-carded.

For the remainder of this project, we will explore a dataset on red-cards and skin color and attempt to understand how different approachs to analysis can lead to different conclusions to the general question: "Are referees more likely to give red cards to darker-skinned players?"


 <img src="images/redcard.jpg" width = 700/>

# The Data Science Life Cycle <a id='section 1'></a>

## Formulating a question or problem <a id='section 1'></a>
It is important to ask questions that will be informative and that will avoid misleading results. 

<div class="alert alert-info">
<b>Question:</b> Recall the questions about red cards and skin color that you developed with your group. Write down that question below, and try to add onto it with the context from the articles. Think about what data you would need to answer your question.
   </div>

**Your questions:** *here*

**Data you would need:** *here*


**Article:** *link*

## Acquiring and cleaning data <a id='subsection 1b'></a>
 
In this notebook, you'll be working with a dataset containing entries for many European soccer players, containing variables such as club, position, games, and skin complexion.

Important to note about this dataset is that it was generated as the result of an [observational study](https://en.wikipedia.org/wiki/Observational_study), rather than a [randomized controlled experiment](https://en.wikipedia.org/wiki/Randomized_controlled_trial). In an observational study, entities' independent variables (such as race, height, zip code) are observed, rather than controlled as in the randomized controlled experiment. Though data scientists often prefer the control and accuracy of controlled experiments, often performing one is either too costly or poses ethical questions (e.g., testing trial drugs and placebo treatments on cancer patients at random). Though our dataset was generated organically—in the real world rather than in a laboratory—it is statistically more challenging to prove causation among variables for these kinds of observational studies.


Please read this summary of the [dataset's description](https://osf.io/9yh4x/) to familiarize yourself with the context of the data:

>*...we obtained data and profile photos from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career. We created a dataset of player dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.*

>*...implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks.*

In [2]:
# run this cell to load the data
data = pd.read_csv("data/CrowdstormingDataJuly1st.csv").dropna()
data = Table.from_df(data)

Here are some of the important fields in our data set that we will focus on:

|Variable Name   | Description |
|--------------|------------|
|`player` | player's name |
|`club` | player's soccer club (team) |
|`leagueCountry`| country of player club (England, Germany, France, and Spain) |
|`height` | player height (in cm) |
|`games`| number of games in the player-referee dyad |
|`position` | detailed player position |
|`goals`| goals scored by a player in the player-referee dyad |
|`yellowCards`| number of yellow cards player received from referee |
|`yellowReds`| number of yellow-red cards player received from referee |
|`redCards`| number of red cards player received from referee |
|`rater1`| skin rating of photo by rater 1 (5-point scale ranging from very light skin to very dark skin |
|`rater2`| skin rating of photo by rater 2 (5-point scale ranging from very light skin to very dark skin |
|`meanIAT`|  mean implicit bias score (using the race IAT) for referee country, higher values correspond to faster white good, black bad associations |
|`meanExp`| mean explicit bias score (using a racial thermometer task) for referee country, higher values correspond to greater feelings of warmth toward whites versus blacks |

As you can see on the table above, two of the variables we will be exploring is the ratings on skin tone (1-5) measured by two raters, Lisa and Shareef. For context, we have added a series of images that were given to them so that you can better understand their perspective on skin tones. Keep in mind that this might affect our hypothesis and drive our conclusions. 

Note: On the following images, the only two were the rating for the two raters coincide is image #3 on the top and image #6 on the bottom. 

<img src="images/L1S1.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="images/L1S2.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="images/L2S2.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="images/L3S4.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="images/L4S5.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="images/L5S5.jpg" style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<p style="clear: both;">

In [3]:
# run this cell show the first ten rows of the data
data.show(10)

playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,photoID,rater1,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177,72,Attacking Midfielder,1,0,0,1,0,0,0,0,95212.jpg,0.25,0.5,1,1,GRC,0.326391,712,0.000564112,0.396,750,0.00269649
john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179,82,Right Winger,1,0,0,1,0,1,0,0,1663.jpg,0.75,0.75,2,2,ZMB,0.203375,40,0.0108749,-0.204082,49,0.0615044
aaron-hughes,Aaron Hughes,Fulham FC,England,08.11.1979,182,71,Center Back,1,0,0,1,0,0,0,0,3868.jpg,0.25,0.0,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
aleksandar-kolarov,Aleksandar Kolarov,Manchester City,England,10.11.1985,187,80,Left Fullback,1,1,0,0,0,0,0,0,47704.jpg,0.0,0.25,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
alexander-tettey,Alexander Tettey,Norwich City,England,04.04.1986,180,68,Defensive Midfielder,1,0,0,1,0,0,0,0,22356.jpg,1.0,1.0,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
anders-lindegaard,Anders Lindegaard,Manchester United,England,13.04.1984,193,80,Goalkeeper,1,0,1,0,0,0,0,0,16528.jpg,0.25,0.25,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
andreas-beck,Andreas Beck,1899 Hoffenheim,Germany,13.03.1987,180,70,Right Fullback,1,1,0,0,0,0,0,0,36499.jpg,0.0,0.0,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
antonio-rukavina,Antonio Rukavina,Real Valladolid,Spain,26.01.1984,177,74,Right Fullback,2,2,0,0,0,1,0,0,59786.jpg,0.0,0.0,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
ashkan-dejagah,Ashkan Dejagah,Fulham FC,England,05.07.1986,181,74,Left Winger,1,1,0,0,0,0,0,0,23229.jpg,0.5,0.5,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522
benedikt-hoewedes,Benedikt Höwedes,FC Schalke 04,Germany,29.02.1988,187,80,Center Back,1,1,0,0,0,0,0,0,59387.jpg,0.0,0.0,4,4,LUX,0.325185,127,0.00329681,0.538462,130,0.0137522


Let's remove the columns we are not going to be working with.

In [None]:
cols_to_drop = make_array("birthday", "victories", "ties", "defeats", "goals",
                "photoID", "Alpha_3", "nIAT", "nExp")

data = data.drop(cols_to_drop)

Let's reload the table to make sure we got rid of all of our undesired columns.

In [None]:
data.show(5)

<div class="alert alert-info">
<b>Question:</b> It's important to evalute our data source. How do you feel about the way ratings on skin tone are collected? What about how implicit/explicit bias is calculated?
   </div>

*Insert answer here*

<div class="alert alert-info">
<b>Question:</b> We want to learn more about the dataset. First, how many total rows are in this table? What does each row represent?
    
   </div>

In [None]:
total_rows = ...
total_rows

*Insert answer here*

<div class="alert alert-info">
<b>Question:</b> If we're trying to examine the relationship between red cards given and skin color, which variables should we consider? Classify the ones you choose as either independent or dependent variables and explain your choices.
    
   </div>

*Insert answer here*

**Source:** Data 88 (Sports Analytics)

**Adapted by:** Alleanna Clark, Ashley Quiterio, Karla Palos Castellanos, Pratibha Sriram, William Furtado, and Andrew Chen