# What is Pandas?

pandas is a powerful library for data manipulation and analysis in Python. It provides data structures for efficiently storing large datasets and tools for working with them in a variety of ways. Some of the main features of pandas that are useful for EDA include:

* Reading and writing data: pandas provides functions for reading and writing data from a variety of formats, including CSV, Excel, and SQL databases.
* Handling missing values: pandas provides functions for identifying and handling missing values in a dataset.
* Data cleaning: pandas provides functions for cleaning and formatting data, such as converting data types and removing duplicates.
* Data visualization: pandas integrates with the Matplotlib library for data visualization, allowing you to create a variety of plots and charts to visualize your data.
* Data aggregation and grouping: pandas provides functions for performing aggregation and grouping operations on your data, such as calculating the mean or sum of a group of values.

In [12]:
import pandas as pd

# Load the Dataset
We will start by importing the necessary libraries and loading the dataset into a pandas DataFrame. We will use the read_csv function to read the data from a CSV file into a DataFrame.

In [10]:
# Load the dataset into a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/cbtn-data-science-ml/python-for-data-analysis/main/datasets/mls_salaries.csv")
df

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.00
1,ATL,Ambrose,Mikey,D,65625.0,65625.00
2,ATL,Asad,Yamil,M,150000.0,150000.00
3,ATL,Bloom,Mark,D,99225.0,106573.89
4,ATL,Carleton,Andrew,F,65000.0,77400.00
...,...,...,...,...,...,...
610,VAN,Teibert,Russell,M,126500.0,194000.00
611,VAN,Tornaghi,Paolo,GK,80000.0,80000.00
612,VAN,Waston,Kendall,D,350000.0,368125.00
613,,,,,,


In [14]:
# Load the dataset into a DataFrame
df_excel = pd.read_excel('mls_salaries.xlsx')
df_excel

Unnamed: 0.1,Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,0,ATL,Almiron,Miguel,M,1912500.0,2297000.00
1,1,ATL,Ambrose,Mikey,D,65625.0,65625.00
2,2,ATL,Asad,Yamil,M,150000.0,150000.00
3,3,ATL,Bloom,Mark,D,99225.0,106573.89
4,4,ATL,Carleton,Andrew,F,65000.0,77400.00
...,...,...,...,...,...,...,...
610,610,VAN,Teibert,Russell,M,126500.0,194000.00
611,611,VAN,Tornaghi,Paolo,GK,80000.0,80000.00
612,612,VAN,Waston,Kendall,D,350000.0,368125.00
613,613,,,,,,


# The Pandas DataFrame
Let's use the read_csv dataframe since that's the most common.

In [17]:
df

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.00
1,ATL,Ambrose,Mikey,D,65625.0,65625.00
2,ATL,Asad,Yamil,M,150000.0,150000.00
3,ATL,Bloom,Mark,D,99225.0,106573.89
4,ATL,Carleton,Andrew,F,65000.0,77400.00
...,...,...,...,...,...,...
610,VAN,Teibert,Russell,M,126500.0,194000.00
611,VAN,Tornaghi,Paolo,GK,80000.0,80000.00
612,VAN,Waston,Kendall,D,350000.0,368125.00
613,,,,,,


In [18]:
type(df)

pandas.core.frame.DataFrame

In [None]:
df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})
df

# The Pandas Series

In [15]:
df['club']

0      ATL
1      ATL
2      ATL
3      ATL
4      ATL
      ... 
610    VAN
611    VAN
612    VAN
613    NaN
614    VAN
Name: club, Length: 615, dtype: object

In [16]:
type(df['club'])

pandas.core.series.Series

## Data Attritubtes

In [19]:
df['club'].shape

(615,)

In [20]:
df[['club', 'last_name']]

Unnamed: 0,club,last_name
0,ATL,Almiron
1,ATL,Ambrose
2,ATL,Asad
3,ATL,Bloom
4,ATL,Carleton
...,...,...
610,VAN,Teibert
611,VAN,Tornaghi
612,VAN,Waston
613,,


In [21]:
df[['club', 'last_name']].shape

(615, 2)

The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

## Filtering

In [22]:
millionaire_club = df['base_salary'] > 1000000

In [27]:
df[millionaire_club]

Unnamed: 0,club,last_name,first_name,position,base_salary,guaranteed_compensation
0,ATL,Almiron,Miguel,M,1912500.0,2297000.0
52,CHI,Nikolic,Nemanja,F,1700000.04,1906333.37
55,CHI,Schweinsteiger,Bastian,M,5400000.0,5400000.0
67,CLB,Higuain,Federico,M,1050000.0,1050000.0
98,COL,Gashi,Shkelzen,F,1575000.0,1668750.0
103,COL,Howard,Tim,GK,2000000.0,2475000.0
224,LA,Alessandrini,Romain,M,1669400.64,1999400.64
230,LA,Dos Santos,Giovani,F,3750000.0,5500000.0
347,NYCFC,Moralez,Maximiliano,M,2000000.04,2000000.04
349,NYCFC,Pirlo,Andrea,M,5600000.0,5915690.0
