# 1: Introduction to Pandas

In this lesson, we will explore how to manipulate and analyze data using Pandas. Pandas excels in data cleaning, preparation, exploration, and transformation, making it an indispensable library for data professionals.

We'll use a dataset of baseball player salaries, which includes columns for Name, Team, League, Position, and Salary. You will learn how to perform essential operations on DataFrames, such as importing data, displaying data, sorting, filtering, running functions, and using group by.

### Helpful links

- [10 minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Pandas tutorials, quizzes and exercises on W3 schools](https://www.w3schools.com/python/pandas/default.asp)

## 1.1 Importing data from a CSV and creating a dataframe

First we need to import our dataset. You can import from various places, including your computer, cloud storage, or github (via a url). We will import a csv file that is stored in github at this url:

https://raw.githubusercontent.com/dnmalan/advanced-data-journalism-23/main/data/baseball_players.csv

Pandas uses objects called **dataframes**,  or a two-dimensional tabular data structure that organizes data into rows and columns.


In [None]:
# Import the libraries needed into your notebook
import pandas as pd


In [None]:
# Create a dataframe called "df", and import a csv file into that dataframe
df = pd.read_csv('https://raw.githubusercontent.com/dnmalan/advanced-data-journalism-23/main/data/baseball_players.csv')

# Display the first few rows of the DataFrame
df.head()

Pandas generally follows this format to run functions:

- **dataframe.function()**

This is what's called a Python **method**, or a function that is associated with an object (the dataframe) and can perform actions on it.

## 1.2 Exploring the dataset
Before doing any analysis, we'll need to explore and get to know the dataset. Let's inspect the first and last few rows, as well as get some initial statistics about our dataset.

In [None]:
# Display the first 5 rows
df.head()

In [None]:
# Display the last 5 rows
df.tail()


You can also look at any number of rows by adding a number within the parentheses.

In [None]:
#display first 10 rows
df.head(10)

If you want to see just one column, use this syntax:

- **df['column']**

In [None]:
# Display only the player name column
df['Name']

In [None]:
# On your own: Display only the team name column




You can also print the entire dataset to see the beginning and end.

In [None]:
#print out the entire dataframe
print(df)

You can also get some basic summaries very quickly with the shape and describe functions.

In [None]:
# Find how many rows and columns in the dataset
df.shape

In [None]:
# Summary statistics for numeric columns
df.describe()

In [None]:
#add this formatting snippet to turn off scientific notation

df.describe().apply(lambda s: s.apply('{0:.2f}'.format))

## 1.3 Analyzing the dataset

Now that we've got a feel for what's in the datsaet, let's do some analysis, including sorting, filtering and functions.

### Sorting

Let's find who makes the highest salary (and lowest salary) by sorting this dataset. We will use the **sort_values** method and incorporating the parameters **by** (to tell Pandas which column to sort by) and **ascending** (to tell Pandas what order to sort by).

In [None]:
#sort by salaries in descending order (largest salary at the top)

df.sort_values(by='Salary', ascending=False)


In [None]:
#sort by salaries in ascending order (smallest salary at the top)

df.sort_values(by='Salary', ascending=True)

In [None]:
# On your own: Sort by team, in both ascending and descending order





### Filtering

Filtering is the way to find a subset of the dataframe that matches a specific criteria or set of criteria. Think of it as searching your data, and the filter will return rows that match. You can search within text or numerical columns, and you can also search within just one or in multiple columns at once.

To filter in Pandas, you'll need several pieces:

- Column: The column you want to search in.
- Comparison operator: equal to (==), not equal to (!=), greater than (>), less than (<), greater than or equal to (>=), and less than or equal to (<=)
- Criteria: The information you want to search by.

The filter generally follows this pattern:

- **df[df['column'] == 'value']**

#### Filtering by numbers

Do not put quotes around the numbers you're filtering by.

In [None]:
#Find players who make the minimum wage

df[df['Salary'] == 555000]

In [None]:
#Find players who make more than $10 million
df[df['Salary'] > 10000000]

#### Filtering by text

Put quotes around the text value you're filtering by.

In [None]:
#find all the Kansas City Royals players
df[df['Team'] == 'Kansas City Royals']


#### Filtering by multiple columns

Now let's put these together and search for anyone who makes the minimum wage on the Kansas City Royals.

To filter by multiple columns, we need to add a logical operator (AND, OR, or NOT) in between the criteria.

- AND: &
- OR: |
- NOT: ~


In [None]:
#Filter by Kansas City Royals and salary of more than 1 million

df[(df['Salary'] > 1000000) & (df['Team'] == 'Kansas City Royals')]

#### Creating a new dataframe from results

Often, we'll want to save the results of a function into a new dataframe so we can do some more analysis on it. To do this, simply create a new dataframe name and assign it to the function.

In [None]:
#new dataframe called "royals" and assign it to the filtered original dataframe
royals = df[df['Team'] == 'Kansas City Royals']


In [None]:
royals.head()

In [None]:
royals.shape

In [None]:
# On your own: Find all of the players who make more than $20 million


