# Basic Visualization in Pandas

As you're doing exploratory analysis you'll often want to use some simple charts to visualize the data. Things we'll cover:
- Line charts
- Bar charts and Histograms
- Scatterplots
- Styling
- Labeling / legends
- Axes

We won't cover these here, but there are many other visualization libraries that can be used in Jupyter notbeooks including:
- [Bokeh](http://bokeh.pydata.org/en/latest/)
- [Seaborn](https://github.com/mwaskom/seaborn)
- [Plotly](https://plot.ly/python/)

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# This magic function allows you to see the charts directly within the notebook. 
%matplotlib inline

# This command will make the plots more attractive by adopting the commone style of of a different library called ggplot
matplotlib.style.use("ggplot")

We can plot a list of numbers as a line chart very quickly with the `plt.plot` [function](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot). 

In [None]:
y = [1,2.5,6,20]
plt.plot(y)

And we can specify the x values in a separate list. 

In [None]:
x = [0,2,3,5]
plt.plot(x,y)

There are many styling options in the [documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot). Color is the one of the more useful styles. There are some common abbreviations you can use for colors like "b" but you can also directly specify a color using its RGB hex representation, e.g. #aa77aa

In [None]:
plt.plot(x,y, color="b")

In [None]:
plt.plot(x,y, color="#aa33aa")

You can also change the line style, and add a marker for each point.

In [None]:
plt.plot(x,y, color="b", linestyle="--", marker="o")

And you can add multiple series to the same plot, including a legend to keep track of what's what. 

In [None]:
plt.plot(x,y, color="b", linestyle="--", marker="o", label="Series 1")
y2 = [10, 5, 10, 15]
plt.plot(x,y2, color="r", linestyle="--", marker="o", label = "Series 2")
plt.legend(loc='best')

We can also customize the axes. 

In [None]:
plt.plot(x,y, color="b", linestyle="--", marker="o", label="Series 1")
y2 = [10, 5, 10, 15]
plt.plot(x,y2, color="r", linestyle="--", marker="o", label = "Series 2")
plt.legend(loc='best')

axes = plt.gca()
axes.set_xticks([0,2,4,6])
axes.set_xticklabels(['zero', 'two', 'four', 'six'], rotation = 30)
axes.set_yticks([0,5,10,15,20,25])

Finally, we can set the size of a figure and save it to a file.

In [None]:
plt.plot(x,y, color="b", linestyle="--", marker="o", label="Series 1")
plt.plot(x,y2, color="r", linestyle="--", marker="o", label = "Series 2")
plt.legend(loc='best')

axes = plt.gca()
axes.set_xticks([0,2,4,6])
axes.set_xticklabels(['zero', 'two', 'four', 'six'], rotation = 30)
axes.set_yticks([0,5,10,15,20,25])

fig = plt.gcf()
fig.set_size_inches(10,7)
plt.savefig("chart.png")

### Charting Examples
Pandas has plotting capabilities directly accessible (though under the hood it's still Matplotlib). Lots more examples can be found [here](http://pandas.pydata.org/pandas-docs/stable/visualization.html). Let's examine how to do some basic analysis and charting of the robocall dataset.

In [None]:
robocall_df = pd.read_csv("Data/Telemarketing_RoboCall_Weekly_Data_Transformed.csv")
robocall_df

In [None]:
bar = robocall_df["issues"].value_counts()
print bar
bar.plot(kind="barh")

In [None]:
bar = robocall_df["type_telemarketing"].value_counts()
bar.plot(kind="bar", rot=30, color="r")

We'd like to know at what times of day these complaints are reported. A histogram would be helpful, so lets create one. 

In [None]:
# Create a new column that's just the hour of the day
robocall_df["hour_of_day"] = pd.DatetimeIndex(robocall_df["time_issued"]).hour
# Drop any rows in which the hour of the day is NaN
robocall_df = robocall_df.dropna(subset=["hour_of_day"])

Pandas has some charting functions built-in, like [`DataFrame.hist()`](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.hist.html) which can be used to quickly create a matrix of histograms. 

In [None]:
robocall_df.hist(column="hour_of_day", by="issues", bins=24, figsize=(12,4), sharey=True)

But if you want to stack histograms you have to approach things a bit differently. In the example below we first filter for two histograms we want to plot, and then pass them both explicitely to the plt.hist() function (which is slightly different from the hist() function that is available directly on a DataFrame). 

In [None]:
robocall_df_robo = robocall_df[robocall_df["issues"]=="Robocalls"]["hour_of_day"]
robocall_df_telemarketing = robocall_df[robocall_df["issues"]=="Telemarketing (including do not call and spoofing)"]["hour_of_day"]
plt.hist([robocall_df_robo, robocall_df_telemarketing], label=["Robocalls", "Telemarketing"], bins=24, stacked=True, color=["#333333", "#888888"])
plt.legend(loc='best')

axes = plt.gca()
axes.set_xlabel("Hour of Day")
axes.set_xticks(np.arange(24))
axes.set_xticklabels(np.arange(24), rotation = 0)

fig = plt.gcf()
fig.set_size_inches(12,4)

If we want to show a smooth version of the histogram we could use an area chart. 

In [None]:
robocall_df["hour_of_day"].value_counts().sort_index().plot(kind="area")

The robocalls dataset isn't well suited for creating a scatterplot because it's mostly categorical data, with some time variables. Instead let's load in a dataset of [AirBnb listings](http://insideairbnb.com/get-the-data.html) from the Washington, DC area. Here's the [csv file](http://data.insideairbnb.com/united-states/dc/washington-dc/2015-10-03/visualisations/listings.csv).

In [None]:
airbnb_df = pd.read_csv("Data/airbnb_listings_dc.csv")
airbnb_df.sort_values("price", ascending=False )

A [scatterplot](http://pandas.pydata.org/pandas-docs/stable/visualization.html#scatter-plot) is a good way to check if there's a relationship between two variables. For example, we would expect there to be a connection between the number of reviews a listing has received and the number of reviews per month it receives. 

In [None]:
airbnb_df.plot(kind="scatter", x="number_of_reviews", y="reviews_per_month")

Of course there are many other types of charts that you can create with Pandas and/or Matplotlib, but you can accomplish a lot with line, bar, histogram, and scatterplots. For more ideas see the [Pandas documentation on visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html), or look into the libraries linked at the top of this notebook. 