# COGS 108 - Data Checkpoint

# Names


* Sean Hwang
* Minhan Lee
* Giho Kim
* Teahyung Kim

<a id='research_question'></a>
# Research Question

Is there a relationship between the height of the soccer players and their entrance to the English premier league and their performance?

We will analyze the trends of the height of all positions in the English premier league from the 1950s.

Also, we will analyze existing professional goalkeepers' height and performance ratings measured by FIFA, international Football Association.

# Setup

In [None]:
""" Bulitins """
import json
import sqlite3
import zipfile

""" Third Party """
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels

# Data Cleaning

This data provides information of soccer players who played in the English premier League from 1880 to 2021. The dataset includes players’ year of birth, height, and positions. Our research will mainly focus on datasets of players from 1950 to 2021 and their height and positions to determine the overall trend of change in player’s height and its difference by positions. This data will be compared and analyzed to other datasets of player’s of different regions or leagues 

In [None]:
""" 1. Download data (Worldfootball Player Height Data) """
with open("./crawler.py", "r") as f:
  file = f.read()
print(file)

Download zip of European Soccer dataset manually from https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset

This dataset provides all the information of more than 8,000 players extracted from the last editions of FIFA Series from 2015 to 2021. The dataset includes players’ information of weight, height, nationality, age, and position.  This dataset will be merged with the Dataset of Premier League from 1950 to help determine the overall trend of change in English Premier League’s soccer players’ height from 1950 to 2021. This data will be used to help answer our research question on the player’s attributes such as height and see if a certain range of height influences the player’s overall rating. Further, we will visualize this trend in height by positions to see how significant height is in each position.

In [None]:
""" Load data from SQlite and json and Convert to Dataframe """
# Unzip sqlite file (Uncompressed File is large for git)
with zipfile.ZipFile("kaggle.zip", 'r') as zip_ref:
  zip_ref.extractall(".")


print("Loading Historical Height Data")
with open("players.json", "r") as f:
  player_history = pd.DataFrame(json.load(f), columns=["Name", "Year", "Team", "Birth", "Height", "Position"])
print(player_history.describe)
print(player_history.head())

We first make all the columns to be lower case for the consistency.

In [None]:
player_history.columns = list(map(str.lower, player_history.columns))
player_history.reset_index(inplace = True, drop = True)
player_history.head()

Removing all the unimportant columns from the dataset.
In the players' postition column, the original dataset has very specific position values such as "CAM" or "CDM". To normalize, we change the values include "B" (stands for Back) to "DF" (Defender), value include "M" (stands for Middle) to "MF" (MID )

In [None]:
ignore_columns = ['short_name', 'sofifa_id', 'player_url', 'potential', 'value_eur',
 'wage_eur','international_reputation','weak_foot','skill_moves','work_rate','body_type', 'real_face','release_clause_eur',
 'player_tags','team_jersey_number','loaned_from', 'contract_valid_until','nation_jersey_number','pace','shooting',
 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling','gk_kicking', 'gk_reflexes', 'gk_speed',
 'gk_positioning','player_traits', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing',
 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control',
 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power',
 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
 'mentality_positioning', 'mentality_vision', 'mentality_penalties', 'mentality_composure', 'defending_marking', 'defending_standing_tackle',
 'defending_sliding_tackle', 'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning',
 'goalkeeping_reflexes', 'ls', 'st', 'rs', 'lw', 'lf', 'cf','rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb',
 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', 'league_rank', 'preferred_foot', 'team_position', 'nation_position']

def clean_position(d):
    if 'B' in d:
        return 'DF'
    elif 'M' in d:
        return 'MF'
    elif 'GK' in d:
        return 'GK'
    else:
        return 'FW'

df = {}
for y in range(2015, 2022):
    df[y] = pd.read_csv(f"players_{str(y % 100)}.csv").drop(columns = ignore_columns)
    df[y]['year'] = y
    df[y]['player_positions'] = df[y]['player_positions'].apply(clean_position)


We merge the values of dataset and change the column names to easier understandable names.

In [None]:
df = pd.concat(df.values())
df.columns = list(map(str.lower, df.columns))
df.rename(columns = {'long_name':'name', 'club_name': 'team', 'dob': 'birth', 'weight_kg': 'weight', 'height_cm':'height'}, inplace = True)
df.reset_index(inplace = True, drop = True)
df.head()

In [None]:
players = pd.concat([df, player_history])
players.reset_index(inplace = True, drop = True)
players.head()

# Names


* Sean Hwang
* Minhan Lee
* Giho Kim
* Teahyung Kim

Is there a relationship between the height of the soccer players and their entrance to the English premier league and their performance?

We will analyze the trends of the height of all positions in the English premier league from the 1950s.

Also, we will analyze existing professional goalkeepers' height and performance ratings measured by FIFA, international Football Association.

In [1]:
""" Bulitins """
import json
import sqlite3
import zipfile

""" Third Party """
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels

This data provides information of soccer players who played in the English premier League from 1880 to 2021. The dataset includes players’ year of birth, height, and positions. Our research will mainly focus on datasets of players from 1950 to 2021 and their height and positions to determine the overall trend of change in player’s height and its difference by positions. This data will be compared and analyzed to other datasets of player’s of different regions or leagues 

Download zip of European Soccer dataset manually from https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset

This dataset provides all the information of more than 8,000 players extracted from the last editions of FIFA Series from 2015 to 2021. The dataset includes players’ information of weight, height, nationality, age, and position.  This dataset will be merged with the Dataset of Premier League from 1950 to help determine the overall trend of change in English Premier League’s soccer players’ height from 1950 to 2021. This data will be used to help answer our research question on the player’s attributes such as height and see if a certain range of height influences the player’s overall rating. Further, we will visualize this trend in height by positions to see how significant height is in each position.

We first make all the columns to be lower case for the consistency.

Removing all the unimportant columns from the dataset.
In the players' postition column, the original dataset has very specific position values such as "CAM" or "CDM". To normalize, we change the values include "B" (stands for Back) to "DF" (Defender), value include "M" (stands for Middle) to "MF" (MID )

We merge the values of dataset and change the column names to easier understandable names.

In [7]:
players = pd.concat([df, player_history])
players.reset_index(inplace = True, drop = True)
players.head()

Unnamed: 0,name,age,birth,height,weight,nationality,team,league_name,overall,player_positions,joined,year,position
0,Lionel Andrés Messi Cuccittini,27.0,1987-06-24,169,67.0,Argentina,FC Barcelona,Spain Primera Division,93.0,FW,2004-07-01,2015,
1,Cristiano Ronaldo dos Santos Aveiro,29.0,1985-02-05,185,80.0,Portugal,Real Madrid,Spain Primera Division,92.0,MF,2009-07-01,2015,
2,Arjen Robben,30.0,1984-01-23,180,80.0,Netherlands,FC Bayern München,German 1. Bundesliga,90.0,MF,2009-08-28,2015,
3,Zlatan Ibrahimović,32.0,1981-10-03,195,95.0,Sweden,Paris Saint-Germain,French Ligue 1,90.0,FW,2012-07-01,2015,
4,Manuel Neuer,28.0,1986-03-27,193,92.0,Germany,FC Bayern München,German 1. Bundesliga,90.0,GK,2011-07-01,2015,
