# Exploratory Data Analysis (EDA) 

Perform EDA on the dataset containing MLS soccer player salaries. We will load the dataset, perform some initial exploration, clean and manipulate the data, and then visualize it to gain insights about the data.

In [1]:
import pandas as pd
# pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Load the Dataset
We will start by importing the necessary libraries and loading the dataset into a pandas DataFrame. We will use the read_csv function to read the data from a CSV file into a DataFrame.

In [2]:
# Load the dataset into a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/cbtn-data-science-ml/python-for-data-analysis/main/datasets/mls_salaries.csv")

# Initial Exploration
Now that we have loaded the dataset into a DataFrame, we can start exploring the data. Here are a few basic operations we can perform to get a feel for the data:

* Use the `head()` method to display the first few rows of the DataFrame
* Use the `info()` method to get a summary of the DataFrame, including the data types of each column and the number of non-null values
* Use the `describe()` method to get summary statistics for numerical columns

In [3]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0
1,ATL,Ambrose,Mikey,D,65625.0,65625.0
2,ATL,Asad,Yamil,M,150000.0,150000.0
3,ATL,Bloom,Mark,D,99225.0,106573.89
4,ATL,Carleton,Andrew,F,65000.0,77400.0


In [4]:
# Get a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   club                     614 non-null    object 
 1   last_name                614 non-null    object 
 2   first_name               610 non-null    object 
 3   position                 604 non-null    object 
 4   base_salary              614 non-null    float64
 5   guaranteed_compensation  614 non-null    float64
dtypes: float64(2), object(4)
memory usage: 29.0+ KB


In [5]:
# Get summary statistics for numerical columns
df.describe()

Unnamed: 0,base_salary,guaranteed_compensation
count,614.0,614.0
mean,297173.0,326375.2
std,672583.9,749121.7
min,52999.92,52999.92
25%,65633.4,70030.35
50%,125000.0,135002.0
75%,255000.0,279875.0
max,6660000.0,7167500.0


In [6]:
df.describe().round()

Unnamed: 0,base_salary,guaranteed_compensation
count,614.0,614.0
mean,297173.0,326375.0
std,672584.0,749122.0
min,53000.0,53000.0
25%,65633.0,70030.0
50%,125000.0,135002.0
75%,255000.0,279875.0
max,6660000.0,7167500.0


In [7]:
df.describe().apply(lambda s: s.apply('{0:.0f}'.format))

Unnamed: 0,base_salary,guaranteed_compensation
count,614,614
mean,297173,326375
std,672584,749122
min,53000,53000
25%,65633,70030
50%,125000,135002
75%,255000,279875
max,6660000,7167500


In [13]:
df['difference'] = df['guaranteed_compensation'] - df['base_salary']

In [14]:
df

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation,difference
0,ATL,Almiron,Miguel,M,1912500.0,2297000.00,384500.00
1,ATL,Ambrose,Mikey,D,65625.0,65625.00,0.00
2,ATL,Asad,Yamil,M,150000.0,150000.00,0.00
3,ATL,Bloom,Mark,D,99225.0,106573.89,7348.89
4,ATL,Carleton,Andrew,F,65000.0,77400.00,12400.00
...,...,...,...,...,...,...,...
609,VAN,Techera,Cristian,M,352000.0,377000.00,25000.00
610,VAN,Teibert,Russell,M,126500.0,194000.00,67500.00
611,VAN,Tornaghi,Paolo,GK,80000.0,80000.00,0.00
612,VAN,Waston,Kendall,D,350000.0,368125.00,18125.00


In [11]:
df.dropna(how="any", inplace=True)

In [12]:
df

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation,difference
0,ATL,Almiron,Miguel,M,1912500.0,2297000.00,-384500.00
1,ATL,Ambrose,Mikey,D,65625.0,65625.00,0.00
2,ATL,Asad,Yamil,M,150000.0,150000.00,0.00
3,ATL,Bloom,Mark,D,99225.0,106573.89,-7348.89
4,ATL,Carleton,Andrew,F,65000.0,77400.00,-12400.00
...,...,...,...,...,...,...,...
609,VAN,Techera,Cristian,M,352000.0,377000.00,-25000.00
610,VAN,Teibert,Russell,M,126500.0,194000.00,-67500.00
611,VAN,Tornaghi,Paolo,GK,80000.0,80000.00,0.00
612,VAN,Waston,Kendall,D,350000.0,368125.00,-18125.00
