<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/2023/november_06_intro_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with the Python logo and the words Data Club on it](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/data-club-python.png)

## Introduction to Pandas

Nov. 6, 2023

by [Roger Creel](https://rogercreel.com) for the [Columbia Data Club](https://github.com/columbia-data-club/), modified from notebooks by Isha Shah and others in the Data club past

This notebook underpins a ~60-75 minute presentation that introduces Pandas to complete beginners to Python and to programming.


# **Intro to Pandas**
*Presented by Columbia University Libraries*
***

Welcome to the Columbia University Library's Intro to Pandas course! These are our objectives:

* Understanding what it *means* to use data
* How to think critically and work responsibly with imperfect data
* Fundamentals of data wrangling using Pandas
* Awareness of Python's statistical and visualization capabilities
*   Awareness of available datasets and where to find them
* An insatiable desire to learn more about using data!






## **Principles of data analysis**
***
No matter the context in which you're using data, there will **always** be a few principles you must follow. At the end of the day, you *are* doing science - you are using empirical observations to test hypotheses (and occasionally, to predict the future based on these hypotheses). Therefore, it is important to follow the same principles that guide scientific inquiry. These are in use far beyond academia. Regardless of where you work, whether a tech firm, bank, non-profit, or research institution, your analytical work **must** be:
1. Well-documented
2. Conscientious in reducing bias (the same way we randomize trials and only collect the data we need to answer a specific question, we have to make sure our data are clean, examined for potential bias, and are suited to the question we want to answer)
2. Reproducible (commenting and code sharing are crucial)
3. Responsibly and clearly communicated (you've done all the work, and now it's important to get it out in the world! Communicating the results of data analyses can be very difficult, especially to folks who don't have a background in it. It is your responsibility to do the best you can in stating what your research can and can't answer, and to make sure that any communication about the data comes from the data - don't make unfounded leaps, or allow others to)

Throughout this guide, you will see a subtitle for each section that shows you the *tool* we are using to perform each part of data analysis. This is to emphasize the greater importance of the *general concepts and principles* of data analysis that are constant across any language or context, rather than the tools used.

## **Getting started**
### *(with Google Colab)*

Topics to be covered:
1. What is Python?
2. Why does it matter?
3. How can you use Python? (IDEs, notebooks, terminal, Colab)
4. What are packages and why do we need them?
5. Basic familiarity with CoLab (shareability, power)
6. Pitfalls of using CoLab
7. Why pandas? Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


## **Data import and quality checking**
### *(with Pandas)*
***
Topics to be covered, all demoed using NYC taxi dataset:

1. Importing csv, txt, excel files
2. Looking at (an extract) of the dataset (.head), assumptions about the data, and the data type for different fields + characteristics of each type
3. Data quality checking
4. The most common ways in which data are imperfect (missing data, duplication, truncation, misleading names)
5. Larger questions you should ask about your data (how was this collected? Is this helpful in answering the question I already have, or should I come up with new questions I can ask this particular data with confidence?)


In [None]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels as sm
import requests
import pyarrow



In [None]:
# Run following so that we can see all outputs, not just last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

We'll first download a month of data from the New York Taxi & Limousine Commission (TLC) [Trip Record Data website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [None]:
# get the URL of one month's data from New York Suway
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

r = requests.get(url) # create HTTP response object
with open("yellow_tripdata_2021-01.parquet",'wb') as f:

      f.write(r.content)

# open parquet file
df = pd.read_parquet('yellow_tripdata_2021-01.parquet')
df.head()

We've saved the file locally in the format TLC gave it to us: an Apache `parquet` file.  

This is an efficient file format, but you'll encounter the `csv` file type more often -- `csv` stands for **C**omma **S**eparated **V**alue -- so let's learn how to save the dataframe we just loaded as a `csv` file.

In [None]:
df.to_csv('yellow_tripdata_2021-01.csv', index=False)

Now let's load the file as you typically might load a file in pandas, with the `read_csv` command:

In [None]:
taxi = pd.read_csv('yellow_tripdata_2021-01.csv',
                 index_col=None,
                 parse_dates=[
                     "tpep_pickup_datetime",	"tpep_dropoff_datetime"
                 ])

# Look at the 'head'of the dataframe, i.e. the top 5 rows
taxi.head()

These columns have long, hard-to-read names.  We'll want to make them shorter and drop the less useful ones.

## **Refresher on Python data structures**

But before we do that, let's review a few key data structures in python and pandas:

In [None]:
# Lists
listex = [1, 2, 3, 4, "python", "makes", "rory", "roar"]
listex[1]
listex[0:5]

# Iterable

In [None]:
# Dictionaries
dictex = {"rory" : "the lion", "columbia": "the university", "founded": 1754}
dictex["founded"]
dictex["rory"]

# Iterable

In [None]:
# Pandas dataframe
type(taxi)

# Think excel

Now let's clean up our data.  

In [None]:
taxi.columns

We don't need all the columns, and the columns we do need have cumbersome names.  Let's fix that.

In [None]:
columns = {
            # 'VendorID',
          'tpep_pickup_datetime':'pickup',
          'tpep_dropoff_datetime':'dropoff',
            'passenger_count':'passengers',
          'trip_distance':'distance',
          # 'RatecodeID',
          # 'store_and_fwd_flag',
          # 'PULocationID',
          #  'DOLocationID',
          'payment_type':'payment_type',
          'fare_amount':'fare',
          'extra':'extra',
          'mta_tax':'tax',
          'tip_amount':'tip',
          'tolls_amount':'tolls',
          'improvement_surcharge':'improvement_surcharge',
          'total_amount':'total_fare',
          'congestion_surcharge':'congestion_tax',
          'airport_fee':'airport_fee'
          }

# choose only columns that are keys in dictionary
taxi = taxi[list(columns.keys())]

# rename columns by values of dictionary
taxi = taxi.rename(columns=columns)

taxi.head()

In [None]:
# Pandas series
type(taxi["pickup"])

# Column

In [None]:
# What are the data types of our fields?
taxi.dtypes

In [None]:
# Descriptions
taxi.describe(include = "all", datetime_is_numeric=True)

**What are some observations you can make from the table above?**

Let's go column by column:

*Pick-up and drop-off dates and times:*
* The earliest date in our dataset is 12/31/2008
<!-- * The maximum date looks unrealistic (2081?) -->
* There are nearly 1,370,000 rows in our dataset, but only about 939,000 unique pickup values and only 936,000 dropoff values. Why? Why would there be fewer unique dropoff & pickup date values than total values, and fewer dropoffs than pickups?

*Moving onto passengers and distance:*
* What does having zero passengers mean?
* What does having zero distance mean?
* Are these distance values intuitive?

*Fares and payment:*
* Why would there be a negative number for a fare or tip?
* What does the column "payment type" mean here, and why is it numeric?
* There was a $7,600 cab ride? Was that a mistake?

Luckily, we can take a deeper dive into questions!

Let's start with pick-up and drop-off times:





In [None]:
# Start with a simple histogram
taxi["pickup_day"] = taxi["pickup"].dt.day

# First with months
taxi.groupby("pickup_day")["pickup"].count().plot(kind="bar")

In [None]:
# Then with dates
# taxi.groupby("pickup")["pickup"].count().plot(kind = "bar")

You'll notice that the above takes *forever*. It actually causes our kernel to fail.  Why is this?

Because we're working on 1,370,000 rows of data! And this particular visualization is trying to account for every single point, which in this case is every single pickup date and time. This is one of the first hurdles you will likely experience when working with large and interesting datasets. So, what can you do to analyze the data more quickly?

* Subset the data
* Select a random sample
* Use only the information needed

But almost all of these require working with the data in a form other than the one it's in already. This brings us to:

## **Data wrangling**
### *(using Pandas)*
***
Topics to be covered:
1. Math and string operations between columns
2. Summarizing
aside: + you can and should define your own functions (we said that code should be reproducible - if others are using your code or looking at it, functions are a good way to keep things in order)
3. Reshaping
4. Merging (will need to find another dataset to merge with)



In [None]:
# Let's go ahead and first try to look at just the year of the pickup to see
# what the deal with those 2008 dates is:

taxi["pickup_year"] = taxi["pickup"].dt.year
taxi.head()
taxi["pickup_year"].plot(kind = "hist")

Wait a minute, it looks like the minimum year is 2008, but that all of the trips are in 2020. What's going on here?

In [None]:
taxi[taxi["pickup_year"] < 2020]

Okay...there are just a few of these. However, this is not something we could have figured out by the histogram, and we don't know how many such rows there are. How can we get a sense of how many entries are from before 2018?
* Make a plot (tried above)
* List the rows (tried above)
* Count!

In [None]:
# Count rows where year < 2018
taxi[taxi["pickup_year"] < 2020].count()

Huh, there are 4 such rows. How do we proceed?
* We can drop them
* We can investigate them further to see if there's something in common with them
* We can change the year and assume 2020

Let's investigate!

In [None]:
# What are the summary stats of these rows and do they differ from the other data?
taxi["pickup_year"].unique()


taxi[taxi["pickup_year"] < 2020]["pickup_year"].plot(kind = "hist")

In [None]:
# Describe these early-year rows
taxi[taxi["pickup_year"] < 2020].describe(include = "all", datetime_is_numeric=True)

In [None]:
# Compare with rows from 2018
taxi[taxi["pickup_year"] >= 2020].describe(include = "all", datetime_is_numeric=True)

In [None]:
# From these basic summarizations (not testing for significance),
# it looks like these rows aren't too different from the post-2018 rows -
# so it's probably safe to go ahead and drop them.

taxi = taxi[taxi["pickup_year"] >= 2020]

Now that we've got the general hang of cleaning data, let's go ahead and clean up the anomalies in passenger number, distance, fare, and payment.

In [None]:
# Cleaning for passengers, distance, fare, and payment
taxi.describe()

# Cleaning passengers
taxi.groupby("passengers")["passengers"].count()

# There are a few rides with zero passengers, which seems suspect. Let's drop.
# (We could have also investigated what the fare is for these rides:)
taxi[taxi["passengers"] == 0]["fare"].plot(kind = "hist")
# This shows that many of these do have some nonzero fare - so they're
# definitely suspect, and also irrelevant to the questions we want to ask.
taxi = taxi[taxi["passengers"] > 0]
taxi.describe()

In [None]:
# Cleaning for distance, fare, payment
taxi.describe()

# Remove trips with 0 distance, after checking how many there are
# Here, it might be good to have a dummy variable rather than plotting a
# histogram or table with all values
taxi["distance"].groupby([taxi["distance"] == 0]).count()

# Looks like these represent just 18.2k values, let's drop
# In another world we could try to interpolate these values using fare,
# but these are not relevant to our research question.

In [None]:

# Also remove the one trip that more than 1000 miles, which seems unrealistic
taxi["distance"].groupby([taxi["distance"] > 1000]).count()


In [None]:
taxi = taxi[(taxi["distance"] > 0) & (taxi["distance"] < 1000)]

taxi["distance"].describe()

In [None]:
# Looks like there is in fact at least one trip that had a distance traveled of
# 0.01. Is this reasonable? Should we have used a higher cutoff than 0?
# Let's calculate in feet!
print(f"0.01 mile = {0.01 * 5280} feet") # number of feet in a mile

# Ok, does it make sense to have a trip that was only 52 feet long?
# One NYC block (north-south) is 264 feet. How many miles is this?

print(f"One block = {264 / 5280} miles \n")
# 0.05 miles! Let's make this the new cutoff.
taxi = taxi[taxi["distance"] > 0.05]
taxi["distance"].describe()

In [None]:
# Cleaning the fare and tip fields
# Multiple choice - which one of these WILL work?
# A - taxi["fare, tip"].describe()
# B - taxi["fare", "tip"].describe()
# C - taxi[["fare", "tip"]].describe()



In [None]:
# Drop negative fares since these look like refunds
# In a transactions database, why would this not be a good idea?
# Should the threshold be higher? What is the base fare for an NYC taxi?

taxi = taxi[taxi["fare"] > 2.5]

In [None]:
taxi["fare"].plot(kind = "hist")

In [None]:
# Dropping far-off fares because they do not seem to be representative
taxi = taxi[taxi["fare"] < 200]
taxi["fare"].plot(kind = "hist")

In [None]:
taxi["tip"].describe()

In [None]:
# Tip seems to be okay - no negative values!

Now for the fun part: let's ask questions!

- Let's investigate the potential effects of increasing the number of people in a taxi. Does it affect how likely and how much someone is to tip? Does it relate to how far they travel?
- I'm nosy, so I also want to know - how much do people generally tip?
- Are there differences in volume of passengers during different times of day?
- What about payment type - who is still using cash, and at what time of day? Are they groups?


In [None]:
# Creating a function and applying it to a pandas series
def pass_sort(row):
  if row['passengers'] > 3:
      return 'Four or more'
  if row['passengers'] > 1:
      return 'Two to three'
  if row['passengers'] == 1:
      return 'One'

taxi["passenger_type"] = taxi.apply(lambda row: pass_sort(row), axis = 1)

# Use column manipulation to create new columns
taxi["tip_pct"] = taxi['tip'] / taxi['fare']
taxi["pickup_time"] = taxi["pickup"].dt.time
taxi.head()

# Reshaping data
taxi_pivot = taxi.pivot(columns = "passenger_type", values = ["pickup_time", "tip_pct"])
taxi_pivot.head()


In [None]:
# Can also use transform to add a column (like R mutate)
taxi["rec"] = 1
taxi["medtippct"] = taxi.groupby("rec")["tip_pct"].transform('median')
# Transform must have a grouping variable

# And can also use transform and apply to add a column and make a comparison
# at the same time
taxi["above_medtippct"] = taxi["tip_pct"].transform(lambda x: x > x.median())

taxi[["tip_pct", "medtippct", "above_medtippct"]].head()



In [None]:
taxi_pivot.head()

In [None]:
# Collapse to find average pickup time and tip_pct
taxi_pivot["tip_pct"].aggregate("median")
taxi_pivot["tip_pct"].aggregate("mean")

In [None]:
taxi_pivot["tip_pct"].aggregate("mean").plot(kind = "bar")

In [None]:
# Visualize tip percent and distance by passenger number

plt.scatter(x = taxi["passengers"], y = taxi["distance"])
plt.title('Passengers vs distance')
plt.xlabel('Passenger number')
plt.ylabel('Distance (mi)')
plt.show()

taxi.groupby("passengers")["distance"].median().plot(kind = "bar")
plt.show()

taxi.plot.scatter(x="passengers", y="tip_pct")
plt.title('Passengers vs tip percentage')
plt.ylabel('Passenger number')
plt.xlabel('Distance (mi)')
plt.show()

taxi.groupby("passengers")["tip_pct"].aggregate("median").plot(kind = "bar")
plt.show()

taxi.groupby("passengers")["tip_pct"].aggregate("mean").plot(kind = "bar")
plt.show()

## **Where to find data**
### *(using CU Libraries, Google datasets, US government agencies, and many more)*
1. Types of data: tabular (survey, transaction, summary, etc.), geospatial, text
2. List of potential sources for each of the above
3. Tools to collect your own data (mTurk or qualtrics for survey data, scraping for text data)

## **Topics not covered today**
***
A list of all the topics that you can dive deeper into (asterisks are by the ones that are most important):
- github*
- statistics*
- communication / translation*
- visualization*
- NLP
- ML
- Cloud computing + access
- geospatial data
- applications