EXPLORING A DATAFRAME

Use this template to get a solid understanding of the structure of your DataFrame and its values before jumping into a deeper analysis.
This template leverages many of pandas' handy functions for the most fundamental exploratory data analysis steps, including inspecting column data types and distributions, creating exploratory visualizations, and counting unique and missing values.

In [None]:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

# Load your dataset into a DataFrame
df = pd.read_csv("data/taxis.csv")

# Print the number of rows and columns
print("Number of rows and columns:", df.shape)

In [None]:
# Print out the first five rows
df.head()

In [None]:

#Understanding columns and values
#The info() function prints a concise summary of the DataFrame. For each column, you can find its name, data type, and the number of non-null rows. This is useful to gauge if there are many missing values and to understand what data types you're dealing with.

df.info()

In [None]:

#To get an exact count of missing values in each column, call the isna() function and aggregate it using the sum() function:

df.isna().sum()

#If there are missing values, you'll have to decide if and how missing values should be dealt with.
#If you want to learn more about removing and replacing values, check out chapter 2 of DataCamp's Data Manipulation with pandas course.

In [None]:

#The describe() function generates helpful descriptive statistics for each numeric column.
#You can see the percentile, mean, standard deviation, and minimum and maximum values in its output.
#Note that missing values are excluded here.

df.describe()

In [None]:
#Use the unique() function to print out the unique values of a column:
df["pickup_borough"].unique()  # Replace with a column of interest

In [None]:

#Use the value_counts() function to print out the number of rows for each unique value:

df["pickup_borough"].value_counts(  # Replace with a column of interest
    dropna=True  # Set to False if you want to include NaN values
)

''''
    Basic data visualizations
            pandas plot() function makes it easy to plot columns from your DataFrame. This section will go through a few basic data visualizations to better understand your data.
                If you need a refresher on visualizing DataFrames, chapter 4 of DataCamp's Data Manipulation with pandas course is a useful reference!
''''

In [None]:
#Boxplots can help you identify outliers:

df.plot(
    kind="box", figsize=(12, 8)  # Specifies a boxplot with set width & height in inches
);

In [None]:

# To further inspect distribution of data in a column, you can use a histogram:

df.plot(
    kind="hist",  # Specifies a histogram
    y="distance",  # Replace with a numeric column of interest
    bins=20,  # Set the number of bins in the histogram
    figsize=(12, 8)  # Set width & height in inches
);


In [None]:

#You can use a bar plot to compare averages (and other aggregations!) of a numeric column across a categorical column:

# Create a new DataFrame with a categorical column to group the data and a numeric column to aggregate
# You can also use sum(), count(), and other aggregations instead of mean()
df_bar = df.groupby(["pickup_borough"])["fare"].mean()

df_bar.plot(
    kind="bar",  # Specifies vertical bar plot
    ylabel="Average fare",  # Add a y-axis label
    xlabel="Starting Borough",  # Add a x-axis label
    figsize=(15, 5)  # Set width & height in inches
);

In [None]:

#If you have any date columns, you can use a lineplot to find patterns, such as seasonality:

# Convert any date and/or time column to datetime format
df["pickup"] = pd.to_datetime(df["pickup"])

# Create a new DataFrame, group by the datetime column and choose an aggregation
# On the datetime column, choose a time object, options can be found here:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components
df_dates = df["pickup"].groupby([df["pickup"].dt.hour]).count()

df_dates.plot(
    kind="line",  # Specifies line plot
    ylabel="Number of trips",  # Add a y-axis label
    xlabel="Hour of the day",  # Add a x-axis label
    figsize=(15, 5)  # Set width & height in inches
);


In [None]:

# Scatter plots are useful to investigate the relationship between two numeric variables:

df.plot(
    kind="scatter",  # Specifies a scatter plot
    x="distance",  # Replace with a numeric column for the x-axis
    y="fare",  # Replace with a numeric column for the y-axis
    figsize=(15, 5)  # Set width & height in inches
);


In [None]:

#To further explore relationships between columns, generate a correlation matrix using pandas' corr() function and a plot it with Seaborn's heatmap() function.

# Generate and print pairwise correlation of columns
cm = df.corr()
print(cm)

# Plot the correlation matrix nicely using Seaborn
sn.heatmap(cm, annot=True)
plt.show()
"""
Eager for more visualizations? Take a look at the documentation of the plot() function to see what other visualizations you can make.
You can also take DataCamp courses to learn more about powerful Python visualization libraries,
such as matplotlib, seaborn, and plotly!
"""