# Visualization

this third notebook will explore data visualization in python using matplotlib

docs: https://matplotlib.org/contents.html

tutorial: https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py

some samples: https://matplotlib.org/tutorials/introductory/sample_plots.html

In [None]:
#use this ipython magic comamnd to display plots inline in the jupyter notebook
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

# this is used to fix a warning about future versions of pandas
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

In [None]:
posts_per_user = pd.read_csv('posts_per_user.csv')

In [None]:
#examine the top 5 rows of the dataframe
posts_per_user.head()

In [None]:
# here we are creating a bar chart of the number of posts each user has made
plt.figure(figsize=(16,2))
plt.bar(posts_per_user['user_name'], posts_per_user['count'])
plt.title("Posts by user")
plt.xlabel("User name")
plt.xticks(rotation=90)
plt.ylabel("Total number of posts")
plt.show()

## More data manipulation

Here we are going to use the `created at` timestamp to order the discussion posts, and make a timeline representation showing the growth in the number of discussion posts over time.

In [None]:
all_posts = pd.read_csv('all_posts.csv')
all_posts.dtypes

In [None]:
#above, the datatype for created_at is 'object', meaning that pandas is treating the column as generic data
# we can cast the creation data to timestamps with `to_datetime`
all_posts['created_at'] = pd.to_datetime(all_posts['created_at'])
all_posts['count'] = 1 # this is a dummy counter

In [None]:
# sort by creation date, and make a cumulative count of the number of posts, using the dummy counter
all_posts = all_posts.sort_values('created_at')
all_posts['cumulative_count'] = all_posts['count'].cumsum()

In [None]:
all_posts.head()

In [None]:
plt.figure(figsize=(16,2))
plt.plot(all_posts['created_at'],all_posts['cumulative_count'])
plt.show()

In [None]:
plt.figure(figsize=(16,2))
plt.scatter(all_posts['created_at'],all_posts['cumulative_count'])
plt.show()

## What next?

Here are some ideas for ways to start exploring plotting:

* Plot a different aggregation of the data, for example, plot the number of posts in each discussion topic as a bar chart
* Add additional chart features, such as a legend.
* Display two (or more) different series of data in the same chart. For example, can you plot the cumulative posts-per-day for each discussion topic?
* Highlight your own activity in the discussions (e.g. can you make a stacked bar chart of "my posts" and "posts made by other users"?