<a href="https://colab.research.google.com/github/cul-data-club/meetings/blob/main/2022/General_Introduction_to_Pandas_and_Exploratory_Data_Analysis_(EDA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Columbia University Libraries Data Club](https://library.columbia.edu/services/research-data-services/data-club.html) presents:

# General Introduction to Pandas and Exploratory Data Analysis (EDA)

based on [the work of Isha Shah](https://github.com/cul-data-club/intro-to-data/blob/bdb85366e3648ce52983d2f3bb090095860ae327/Intro_to_Data_in_Python_IS_v3.ipynb) for Columbia University Libraries Data Club

[Sign up for the Data Club Mailing list](https://listserv.cuit.columbia.edu/scripts/wa.exe?SUBED1=CUL-DATA-CLUB&A=1)

---

## Objectives:

* Discriminating critically and analyzing responsibly (always) imperfect data
* Recognizing and applying data wrangling fundamentals in Pandas
* Demonstrating knowledge of resources for finding data
* Relating an interest in working with data

## Principles of Data Analysis

No matter the context in which you're using data, there will always be a few principles you must follow. At the end of the day, you are doing science - you are using empirical observations to test hypotheses (and occasionally, to predict the future based on these hypotheses). Therefore, it is important to follow the same principles that guide scientific inquiry. Your analytical work must be:

1. Well-documented
1. Conscientious in reducing bias (the same way we randomize trials and only collect the data we need to answer a specific question, we have to make sure our data are clean, examined for potential bias, and are suited to the question we want to answer)
1. Reproducible (commenting and code sharing are crucial)
1. Responsibly and clearly communicated (you've done all the work, and now it's important to get it out in the world! Communicating the results of data analyses can be very difficult, especially to folks who don't have a background in it. It is your responsibility to do the best you can in stating what your research can and can't answer, and to make sure that any communication about the data comes from the data - don't make unfounded leaps, or allow others to)

[Pandas](https://pandas.pydata.org/) is a Python data analysis tool that can help achieve the above.


In [None]:
# INITIAL ENVIRONMENT (run only once)

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import os

# Let Colab print every output, not just the output of the final command
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Importing Data with Pandas

Pandas has several `.read_*()` methods for importing data in different formats, such as `.csv` files, `.pickle` files, `.json` files, `.xlsx` Excel files, and so on. Use control/command-space completion to see all the `.read_*()` methods available and then use it again to see the signature for `.read_csv()`.

In [None]:
# Use control/command-space completion to see all of the Pandas .read methods 
# Then do it again to see the signature of Pandas's .read_csv() method:

# pd.read

We can load data straight from the internet. In this case, we will use a subset of the [New York City Taxi and Limousine Commission Trip Record](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) dataset, limited to February 2018.

In [None]:
# Import file from github
url = "https://raw.githubusercontent.com/columbia-university-libraries-data-club/intro-to-data/master/taxi-data.csv"
# "df" is the pythonic name for the Pandas DataFrame created by Pandas .read methods.
df = pd.read_csv(url)
df.head()

Chirag Goyal provides [20 crucial Pandas methods for EDA](https://www.analyticsvidhya.com/blog/2021/04/20-must-known-pandas-function-for-exploratory-data-analysis-eda/), and we see the first in use here. Use control/command-space completion to read the documentation about `.head()`. A few other methods and properties let us understand the basic shape (including `.shape`!) of our data.

In [None]:
df.tail()
print('\n')
df.info()
print(f'\ndf.shape: {df.shape}')
print(f'df.size: {df.size} (also: rows * columns = {df.shape[0] * df.shape[1]})')
print(f'df.ndim: {df.ndim}')

What can we already say about this dataset? Note particularly the `pickup` and `dropoff` columns.

On the other hand, the "Non-Null Count" indicates that we have no null values, meaning we don't have to use some of the EDA methods, like `.isna()` or `.dropna()`. We should check for duplication, however.

In [None]:
df.duplicated().sum()

Goyal's other methods and properties include `.sample()`, `.nunique()`, `.index`, `.columns`, `.nlargest()`, `.corr()`, `.dtypes()`, `.memory_usage()`, and `.value_counts()`. This last one we'll use below.

In [None]:
#In the meantime...

In [None]:
# Import again, but this time parsing dates
df_unparsed_dates = df
df = pd.read_csv(url, parse_dates=['pickup', 'dropoff'])
df.info()

In [None]:
# Now compare descriptions of both datasets
df_unparsed_dates.describe(include="all", datetime_is_numeric=True)
print('\n')
df.describe(include="all", datetime_is_numeric=True)

How is our story about the NYC taxi data changing? What on earth do we make of some of these values?

Let's go column by column:

1. Pick-up and drop-off dates and times:
  - The earliest date in our dataset is 12/31/2008
  - The maximum date looks unrealistic (2081?)
  - There are nearly 770,000 rows in our dataset, but only about 630,000 unique pickup and dropoff values. Why? Why would there be fewer unique dropoff than pickup date values?
1. Passengers and distance:
  - What does having zero passengers mean?
  - What does having zero distance mean?
  - Are these distance values intuitive?
1. Fares and payment:
  - Why would there be a negative number for a fare or tip?
  - What does the column "payment type" mean here, and why is it numeric?
  - There was a $2,600 cab ride? Was that a mistake?

  Let's start isolating our dataset to see what is happening.

In [None]:
# Create a new column of just months
df['pickup_month'] = df['pickup'].dt.month

# Plot a bar chart by month
df.groupby('pickup_month')['pickup'].count().plot(kind='bar')

In [None]:
# Can we get this information more digestably?
df.groupby('pickup_month')['pickup'].count().plot(kind='bar', logy=True)

In [None]:
# Or, as a table:
df['pickup_month'].value_counts()

In [None]:
# Let's repeat the above, but with years and not months:
df['pickup_year'] = df['pickup'].dt.year
df['pickup_year'].value_counts()
df.groupby('pickup_year')['pickup'].count().plot(kind='bar', logy=True)

In short, the data for NYC taxi trips for February 2018 includes data from not February and from not 2018. Time to wrangle.

## Wrangling Data with Pandas to Make Analysis _Possible_

Pandas has a peculiar (R-derived) syntax for subsetting the data, which breaks the dataframe into Series (the underlying structure of the DataFrame) that let you use various operators.

In [None]:
# See just the fares from 2081
df[df['pickup_year'] == 2081].describe(include="all", datetime_is_numeric=True)

# Or from before 2018
df[df['pickup_year'] < 2018].describe(include="all", datetime_is_numeric=True)

In [None]:
# Let's just limit the dataset to what we want and not worry about the broken entries

df = df[df['pickup_year'] == 2018][df['pickup_month'] == 2]
print(df['pickup_year'].unique())
print(df['pickup_month'].unique())

In [None]:
# Cleaning for passengers, distance, fare, and payment
df.describe()

In [None]:
# Cleaning passengers
df.groupby("passengers")["passengers"].count()

# There are a few rides with zero passengers, which seems suspect. Let's drop.
# (We could have also investigated what the fare is for these rides:)
df[df["passengers"] == 0]["fare"].plot(kind = "hist")
# This shows that many of these do have some nonzero fare - so they're 
# definitely suspect, and also irrelevant to the questions we want to ask.
df = df[df["passengers"] > 0]
df.describe()

In [None]:
# Cleaning for distance, fare, payment
df.describe()

# Remove trips with 0 distance, after checking how many there are
# Here, it might be good to have a dummy variable rather than plotting a
# histogram or table with all values
df["distance"].groupby([df["distance"] == 0]).count()

# Looks like these represent just 7.6k values, let's drop
# In another world we could try to interpolate these values using fare,
# but these are not relevant to our research question.

In [None]:
df = df[df["distance"] > 0]
df["distance"].describe()

In [None]:
# Looks like there is in fact at least one trip that had a distance traveled of 
# 0.01. Is this reasonable? Should we have used a higher cutoff than 0?
# Let's calculate in feet!
0.01 * 5280 # number of feet in a mile
# Ok, does it make sense to have a trip that was only 52 feet long?
# One NYC block (north-south) is 264 feet. How many miles is this?
264 / 5280
# 0.05 miles! Let's make this the new cutoff.
df = df[df["distance"] > 0.05]
df["distance"].describe()

In [None]:
# Cleaning the fare and tip fields
# Multiple choice - which one of these WILL work?
# A - df["fare, tip"].describe()
# B - df["fare", "tip"].describe()
# C - df[["fare", "tip"]].describe()


In [None]:
df[["fare", "tip"]].describe()

In [None]:
# Drop negative fares since these look like refunds
# In a transactions database, why would this not be a good idea?
# Should the threshold be higher? What is the base fare for an NYC taxi?

df = df[df["fare"] > 2.5]
df["fare"].plot(kind = "hist", logy=True)

In [None]:
# Dropping far-off fares because they do not seem to be representative
df = df[df["fare"] < 200]
df["fare"].plot(kind = "hist", logy=True)

Now for the fun part: let's ask questions!

Let's investigate the potential effects of increasing the number of people in a taxi. Does it affect how likely and how much someone is to tip? Does it relate to how far they travel?
I'm nosy, so I also want to know - how much do people generally tip?
Are there differences in volume of passengers during different times of day?
What about payment type - who is still using cash, and at what time of day? Are they groups?

In [None]:
# Creating a function and applying it to a pandas series
def pass_sort(row):
  if row['passengers'] > 3:
      return 'Four or more'
  if row['passengers'] > 1:
      return 'Two to three'
  if row['passengers'] == 1:
      return 'One'

df["passenger_type"] = df.apply(lambda row: pass_sort(row), axis = 1)

# Use column manipulation to create new columns
df["tip_pct"] = df['tip'] / df['fare']
df["pickup_time"] = df["pickup"].dt.time
df.head()

# Reshaping data
df_pivot = df.pivot(columns = "passenger_type", values = ["pickup_time", "tip_pct"])

In [None]:
# Can also use transform to add a column (like R mutate)
df["rec"] = 1
df["medtippct"] = df.groupby("rec")["tip_pct"].transform('median')
# Transform must have a grouping variable

# And can also use transform and apply to add a column and make a comparison 
# at the same time
df["above_medtippct"] = df["tip_pct"].transform(lambda x: x > x.median())

df[["tip_pct", "medtippct", "above_medtippct"]].head()

In [None]:
# Collapse to find average pickup time and tip_pct
df_pivot["tip_pct"].aggregate("median")
df_pivot["tip_pct"].aggregate("mean")

In [None]:
df_pivot["tip_pct"].aggregate("mean").plot(kind = "bar")


In [None]:
# Visualize tip percent and distance by passenger number

plt.scatter(x = df["passengers"], y = df["distance"])
plt.title('Passengers vs distance')
plt.xlabel('Passenger number')
plt.ylabel('Distance (mi)')
plt.show()

df.groupby("passengers")["distance"].median().plot(kind = "bar")
plt.show()

plt.scatter(x = df["passengers"], y = df["tip_pct"])
plt.show()

df.groupby("passengers")["tip_pct"].aggregate("median").plot(kind = "bar")
plt.show()

df.groupby("passengers")["tip_pct"].aggregate("mean").plot(kind = "bar")
plt.show()