# Lab: Visualising Data (1)

It is far easier to look at trends in data by creating plots. Below we do just that and briefly look at plotting data by date.

In [None]:
import pandas as pd
df = pd.read_csv('data/raw/office_ratings.csv', encoding='UTF-8')

In [None]:
df.head()

## Plots

## Univariate - a single variable

Plots are a great way to see trends.

In [None]:
?df.plot

In [None]:
df.plot()

We can look at points instead of lines.

In [None]:
df['total_votes'].plot(title='Total Votes')

In [None]:
?df.plot

In [None]:
df['imdb_rating'].plot()

Or we could create subplots.

In [None]:
df.plot(subplots=True)

Season and episode is not at all informative here.

In [None]:
df[['imdb_rating', 'total_votes']].plot(subplots=True)

In [None]:
?df.plot

In [None]:
df[['imdb_rating', 'total_votes']].plot(subplots=True, kind='hist')

Unfortunatly, our x axis is bunched up. The above tells us that the all our IMDB ratings are between 0 and a little less than 1000... not useful.

Probably best to plot them individually.

In [None]:
df[['imdb_rating']].plot(kind='hist')

Quite a sensible guassian shape (a central point with the frequency decreasing symmetrically).

In [None]:
df[['total_votes']].plot(kind='hist')

A positively skewed distribution - many smaller values and very few high values.

## Bivariate

The number of votes and the imdb rating are not independent events. These two data variables are related.

Scatter plots are simple ways to explore the relationship between two data variables. Note, I use the term data variables instead or just variables to avoid any confusion.

In [None]:
df.plot(x='imdb_rating', y='total_votes', kind='scatter', title='IMDB ratings and total number of votes')

That is really interesting. The episodes with the highest rating also have the greatest number of votes. There was a cleary a great outpouring of happiness there.

Which episodes were they?

In [None]:
df[df['total_votes'] > 5000]

Excellent. Any influence of season on votes?

In [None]:
df.plot(x='season', y='imdb_rating', kind='scatter', title='IMDB ratings and season')

Season 8 seems to be a bit low. But nothing too extreme.

## Dates

Our data contains air date information. Currently, that column is 'object' or a string.

In [None]:
df.head()

In [None]:
df.dtypes

We can set this to be datetime instead. That will help us plot the time series of the data.

In [None]:
df['air_date'] =  pd.to_datetime(df['air_date'])
df.dtypes

In [None]:
df.plot(x = 'air_date', y = 'total_votes', kind='scatter')

We can look at multiple variables using subplots.

In [None]:
df[['air_date', 'total_votes', 'imdb_rating']].plot(x = 'air_date', subplots=True)

## Multivariate

Our dataset is quite simple. But we can look at two variables (total_votes, imdb_rating) by a third (season).

In [None]:
df.groupby('season').plot(kind='scatter', y = 'total_votes', x = 'imdb_rating')

There is a lot more you can do with plots with Pandas and Matplotlib. A good resource is the [visualisation section of the pandas documentation](https://pandas.pydata.org/docs/user_guide/visualization.html#basic-plotting-plot).