# Introduction to Pandas

<img src="https://miro.medium.com/max/1400/1*6d5dw6dPhy4vBp2vRW6uzw.png" width = 500>

## Learning Objectives

**After this lesson, you will be able to:**
- Define what pandas is and how it relates to data analysis/data science.
- Manipulate pandas dataframes and series.
- Read in csv data from a variety of places.
- Filter and sort data using pandas.
- Manipulate dataframe columns.
- Know how to handle null and missing values.

## What are python libraries?

In Lesson 1 we learned to create functions - blocks of code you expect to re-use multiple times. If you think there are other people who would need it as well, you can make your code available for others to install and use - these are called libraries.

## What is pandas?

- `pandas` is a data analysis library for Python
- stands for "panel data"

It helps with data analysis because it is: 
- fast
- has lots of useful functionality
- a large open source community

We have already downloaded pandas as part of the installation instructions, now we need to import it into this notebook:

Every Python library will have documentation online:

https://pandas.pydata.org/docs/reference/index.html

I generally find google has better search functionality than internal search on the website.

Have a look at the different fields in this example: 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby

## A few quick examples of things you can do in pandas

And also merge multiple datasets.

CCG Info: https://opendata.arcgis.com/api/v3/datasets/bfb87228cf9e4c44bad0cffa353d0fc8_0/downloads/data?format=csv&spatialRefId=4326

In [1]:
ccg_url = 'https://opendata.arcgis.com/api/v3/datasets/bfb87228cf9e4c44bad0cffa353d0fc8_0/downloads/data?format=csv&spatialRefId=4326'

## Let's start with the building blocks of pandas

Pandas introduces two important new data types: `Series` and `DataFrame`

`Series`

A `Series` is a sequence of items, where each item has a unique label (called an `index`). At first it might seem the same as a list, but there are a few differences:
- They can have custom indices like a dictionary
- They allow vectorised operations
- They are the building blocks of dataframes

`DataFrame`

A `DataFrame` is a table of data. It is a collection on pandas series. Each row has a unique label (the `row index`), and each column has a unique label (the `column index`).

To review: A pandas series is a list of data with an associated index, and a pandas dataframe is a collection of series with a shared index.

### Reading Files, Selecting Columns, and Summarizing

To read in a file, we use the method read_csv (or read_excel), and the path to the file. As you saw above we can also use a url.

There are a few in-built methods that can help us to get a feel for the data:

If we want to look at individual columns, we use square brackets.

## Exercise

1) Read in the data from this dataset: https://raw.githubusercontent.com/carnall-farrar/python_club/master/data/top_50_songs.csv

2) Print a list of all genres 

3) Find the average popularity score

## Filtering and sorting

Pandas makes it very easy to filter the data, but its good to understand what's happening under the hood when we do this.

Performing simple comparisons gives as a series of boolean values

We then use that boolean series to pull out the rows that we want. We use the method `loc`:

The syntax is:

`my_dataframe.loc[<filter_condition>, <column>]`

We can use this both to filter the data, and also edit the values for a subset of data.

There is a also a method called `iloc`, which uses numerical conditions rather than qualitative

 We can use as many conditions as we like!

## Exercise

1) Using the song data from the previous exercise, filter to just songs by Rihanna

2) Filter to pop songs with a Popularity of over 80

## Renaming and removing columns

We can rename specific columns using a dictionary

We can also rename all the columns in one go with a list.

To remove columns we can either use drop or select a subset

Note the difference between performing the operation in place and reassigning the change to the variable

## Handling missing data

We can locate rows with missing data using `isnull()`, or negate it with `~`

If we want to remove missing data we can use `dropna()` - let's take a look at the documention:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

Make sure this is actually what you want to do!

## Saving to csv

The command `to_csv` will save a csv of your data which you can then import/open in Excel. We specify the location of the file similarly to when we read in data.

I usually include `index = False` if my data has no meaningful index.

## Exercise

1) Rename the title column to song name

2) Filter out all the columns where the artist name is missing

3) Save the filtered data to csv - please put it in the same data folder as the data we read in at the start of the class